MathOverflow generates trackbacks to the arXiv. This means that whenever someone mentions or links to an arXiv paper, we notify the arXiv of this, and they generate a link back to the MathOverflow post. In fact, the arXiv supports trackbacks from a wide variety of sources, including a number of maths blogs. Nevertheless, MathOverflow generates a rather large fraction of the total trackbacks!
You can see a list of recent trackbacks at <http://arxiv.org/tb/recent>.
The trackbacks are generated by an external script that checks for new or modified posts on MathOverflow, looks through the content for links to the arXiv or arXiv identifiers, and then notifies the arXiv of new links. This all happens independently of StackExchange, and maintenance of the system is responsibility of the moderators. This purpose of this post is to document this system.
The code that generates trackbacks was originally written by Anton and Scott M, and then partially rewritten by Scott M after the transition to 2.0. It is hosted in the MathOverflow mercurial repository, under the trackbacks/ folder.
The main entry point is the script trackback.cron.sh. This should be run at regular intervals (we currently do it hourly), but it is quite robust and can be run more or less frequently, and should cope even with long delays. We currently run this script via a cron job on a machine running on Amazon EC2 (maybe more about that later).
The script itself has two essentially separate components, and two pieces of state.
The state consists of a file timestamp.txt, containing the date of last successful completion of the script, and a file trackbacks containing a list of all previously posted trackbacks. Both of these files are tracked in the mercurial repository, so moving to run on a new machine should be as simple as committing and fetching the repository.
The script is written in an amalgam of bash and python.
The first component of the script, split between make-checklist.py and scrape-page.py, queries MathOverflow via the API, to find the content of all questions, answers and comments modified since the last successful completion timestamp. (If this script has not run for a while, this can be time consuming, as it requires respecting the throttles requested in the API responses.) It then looks through this content looking for either links to the arXiv or arXiv identifiers. We do this using a series of regular expressions, which allow users to use a variety of different syntaxes! We invested a reasonable amount of time getting this right, going by hand through the database until our regexs seemed to be catching almost everything. If any one ever needs to extract arXiv identifiers from text, findArxivIds.py is the place to look. Finally we write out the triple
<post-id> <arxiv-id> <post-title>
for each found link, to the temporary file newtrackbacks.
The second component of the script, ./send-new-trackback.sh, then reads this output from newtrackbacks, filters out any links which have already been posted according to the state file trackbacks, posts all the new links to the arXiv, and adds them to the trackbacks file.
Finally, the main script runs the two components in turn, and if they both complete successfully updates the timestamp in timestamp.txt.
It’s possible for this mechanism to miss some links, so we used to also regularly scrape database dumps. Our scripts to do this became obsolete with the transition to 2.0, and no one has fixed them yet.