Automating Wikipedia History

This recent Jon Udell entry about Wikipedia wars mentioned a great idea, but I don’t have the time to code it.

I’d love to see a tool for animating Wikipedia history for a given entry or block of text (see Udell’s screencast for an example). Bonus points for highlighting what changed in each version, and extra special bonus points for a way to scrub backwards and forwards through time. I don’t care if it’s a Greasemonkey extension, Flash or Ajax, as long as it does the job.

Lazyweb, hear my plea! $50 $250 (and a free Flickr Pro account) to the best implementation, ruthlessly decided by me in about a week. If anyone else wants to kick in money for the bounty, feel free to post a comment. (If your application meets Jason Scott’s criteria in the comments below, you’ll earn an additional $50.)

Update: Two amazing entries were submitted so far, both using the Greasemonkey extension for Firefox. Dan Phiffer’s Wikipedia Animate and Corey’s WikiDiff. Others are still in development, and a winner will be announced on Tuesday.

June 21, 2005: Two more entries! John Resig’s AniWiki and Colin Hill’s BetterHistory. Also, note that the first two submissions have had big changes… Give them all a try, and stay tuned for the winner later today.

June 27, 2005: The winners!

60 thoughts on “Automating Wikipedia History

  1. I considered it, but Fundable doesn’t allow you to create a group action without a target amount or set contribution. Ideally, I’d want to allow people to just continue to kick in arbitrary amounts of money until someone decides to claim the bounty. If nobody else contributes, the bounty is still seeded with my $50.

  2. Hey there, Andy.

    You know my opinion on Wikipedia. A lot of it comes from my research, and I am all for a tool like this coming out, ESPECIALLY if it allows everyone to totally break out the tracking of wikipedia work, and be able to tell when information is pulled away. Basically, a Wikipedia client that works to Wikipedia like a graphical IRC client works with IRC. I offer $50 to this goal.

  3. One avenue is to contact the history flow guys at IBM to find out if there is a more granular version. History flow seems to do what you want at a more macro time scale. Being able to navigate that flow seems like a natural add on.

  4. Doing this is fairly straightforward with xmlHTTPRequest. You create a javascript bookmarklet, and when you visit the article and click on it, it downloads the article history (simply http://en.wikipedia.org/w/index.php?title=Article&action=history), using a regular expression to filter out URLs to the last 500 edits and storing them in an array. You then concatenate the regexes into full urls and use a for loop based on the length of the array to download a buffer of old articles (probably skip ten at a time) and save them to another array. After enough have been gathered (akin to when your media player buffers so it doesn’t lag) it erases the page and replaces it with the html it garnered in the first xmlHTTPRequest, and after 1 second it just ups the array indice etc…, while simultaneously downloading the rest.

    I don’t have time to implement this, but if I did I would have an absolute positioned div across the top, akin in size to the blogger navbar, and it would contain some basic options, such as number of articles to buffer before beginning the show etc.. For the articles that I downloaded and saved in the buffer, I would extract only the contents of div id=”bodyContent”. Below my navbar, the entire rest of the document would become this div.

    This is actually quite a lot of hard work. I have lying around here somewhere a python script which downloads the article histories and scrapes them into a format that HistoryFlow recognizes (since the current version seems to have no automatic way of doing this), and that is suitable enough for me, personally.

  5. Brian: I’ve actually already finished everything you mentioned. A couple differences from what you mentioned, however:

    1) You can configure how many revisions you wish to skip at a time (along with how time, in seconds, for page delay).

    2) You can change the start/end date (and even go in reverse).

    3) Rather then replacing a DIV, I simply replace the entire Body element with the new contents – it’s suprisingly fast. (Although, I may change this in the future)

    4) The problem is that you can’t do it with a bookmark (due to browser security issues), so it has to be done using either a server-side script (not portable) or a Greasemonkey script, like what I’m using (so far).

  6. Sounds great, can’t wait to see your work. In the meantime i’ll spruce up this history flow script since that seems wanted and add a Flickr pro account to the bounty while i’m at it.

  7. I put together a quick example of what I think you want last night. You can check out the ‘beta’ version at my blog

    There are some problems that I describe in the above link, but I think it is pretty cool for just a few hours of hacking stuff together.

  8. I’m hoping that whatever tool/script/etc. gets developed, it will work on various wikis, not just wikipedia. I know at least one other heavily edited wiki that I’d love to see the animated history of.

  9. Corey: I tested out your script and there’s a couple problems. I tested it first and it crashed my browser twice – not entirely sure why, so I restarted and started Firefox with a new profile. It doesn’t crash now, but I get a Javascript error saysing ‘rawRevisions has no properties’, Line 300. Another issue is that you’re loading all of the revisions every time you hit any Wikipedia entry, which is rather harsh (I have yet to get a list of yours to load completely). It should definitely be optional.

  10. dave bug: Excellent suggestion, I’ll see if I can’t work it into mine. Do you happen to know what Wiki software they’re using? Or what web site it is? This would help when it comes to testing.

  11. John: Thanks for looking at it. What version of Firefox/Greasemonkey are you using? I’d like to install those versions on my machine to see if it crashes on first install. I’ve tried it on Firefox 1.0.4/Greasemonkey 0.3.3 and it works find for most pages.

    Also, there is a hide link on the revision element. If you click that it will not load all the revisions each time (And it will remember it’s last state). This way you can use the diff tool only when you. It default state is to be shown since I figured most people would want to try it out after they download it.

  12. Corey: I think the reason why it crashed the first time was due to the fact that I had left my Wikipedia-mod script running, and they probabaly collided (some how). I was running Firefox 1.0/Greasemonkey 0.3.3 – I just upgraded to 1.0.4 – and no luck getting it to run on either (this is on Windows). I can test it on my Mac when I get home tonight.

    Yeah, I noticed the ‘hide’ link after I posted that last comment, although, in my defense, blue-on-blue is kind of hard to read, hehe.

  13. I regret to affirm that WikiDiff crashed my browser, too. It worked absolutely fine on the first page where I tried it; then I went to another page, and instead of displaying the “loading” script, it displayed the page (my talk page) with no content, then after 10 seconds or so the browser crashed. I’m using Firefox 1.0.4 and Greasemonkey 0.3.3, and both have been stable for at least the last week.

    dave bug: There are literally hundreds of wiki implementations. The version that Wikipedia is built on, which they call MediaWiki, is specialized to meet the needs of an open encyclopedia. If the script is looking at the History page and screen scraping in some fashion, I wouldn’t expect that to work in other wikis. I certainly wouldn’t expect one person to write a generalized script which would encompass all possible History page formats. Just sayin’. I’d imagine that a working Mediawiki script could be fairly easily modified to work on another wiki, but that’s a different kettle of fish.

  14. On my setup (Firefox 1.0.4/Mac, GM 0.2.6) WikiDiff starts doing something (a blue window appears), but then it just kind of sits there. Maybe I should be more patient?

    I’ve taken a shot at implementing this, but my Greasemonkey script expects you to visit the history page first, and then does the animation from there.

    As for generalizing this sort of thing — I for one would love to see some kind of standardized API for pushing data into and getting meta-data out of different wiki engines.

  15. Dan: I really like yours, at first I was kind of miffed about having to go to the history page, but it makes sense in context. (And I like the addition of the Animate tab, heh) My qualms (at the moment) are that when the pages are loading it flashes something fierce (making it unviewable, until completely loaded) – and you can’t configure how long you wish to wait on a particular page (1s, 5s – it just jumps straight to the next page).

  16. Thanks for the feedback John — I updated my script to allow for different animation speeds and a mechanism for skipping minor edits.

    I’ll have to look into optimizing the way stuff gets displayed to reduce the flicker. Right now I’m just replacing element.innerHTML — maybe I should cache the next upcoming frame to let the browser render it first?

  17. John, Dan: Luckily, the specific wiki I was thinking of is also using MediaWiki. It makes sense that this script wouldn’t inherently work with other wiki implementations (though an overall API would rock, and is a great idea), but hopefully it’d at least work with various MediaWiki wikis.

    Unluckily, the site I’m talking about is password protected. Plenty of other open ones to test it out on, though:

    http://meta.wikimedia.org/wiki/Sites_using_MediaWiki

  18. Well, just tried it out on the wiki I was talking about (after adding the url to the allowable ones), and it seems to work great. It was pretty flickery, but I just paused and let them all load. Excellent job, I look forward to seeing future changes.

  19. Pretty swanky, Dan. The draggable playbar is really fresh.

    I’m currently working on a similar approach, though it uses a few server-side scripts to work. This removes the need for Greasemonkey (great for Safari users), and it’s allowing me to add a few niceties like sparklines.

  20. Looking slick Dan… I agree with your idea of getting some API out of this. The hardest part was figuring out how to get the data.

    Anyway, I reworked my greasemonkey script this afternoon. You can get the new one here The crashing problem that was causing havok should be fixed now.

  21. If there’s an enthusiastic coder out there, I would drop an other hundred bucks for something like this:

    “Every Wikipedia page has some metadata-like attributes those are notshown, or not easily recognizable, but stored in the database. These are the following

    – was the content of the page discussed ever, by anyone,

    – when was the page started,

    – how many people contributed to the page,

    – how many edits were made to the page,

    – were there any major flame or vandalism regarding the content of the page,

    – how many other pages link here.

    I propose that if every Wikipedia page would have a graphical representation of these data or they relation to each other, that would help the user to have an immediate opinion of how much should she/he should trust that page. (Font size, coloring, shading could be easily done even within HTML using CSS – no need for special graphic generation methods.)”

  22. I created a 42mb avi video (link to page to get it) of the entire Heavy metal umlaut history using Wikipedia Animate. I couldn’t find a video editor that didn’t trash the resolution so I just left it as is for now. If someone wanted to clean up the end that’d be swell. Here’s the simple AutoIt 3 code I used to move the mouse for me (no, i’m not that patient nor coordinated =)

    AutoItSetOption(“WinTitleMatchMode”,2)

    Sleep(15000)

    WinActivate(“Heavy”)

    If NOT WinActive(“Heavy”) Then WinWaitActive(“Heavy”)

    If WinActive(“Heavy”) Then

    MouseClickDrag(“left”,225,147,619,148,100)

    MouseClickDrag(“left”,619,148,225,147,100)

    EndIf

  23. WikipediaAnimate is really nice, and very useful already.

    Two things would help even more IMHO:

    *Accessor keys – For incremental stepping back/forward and pause/play through the article history. One should be careful to not conflict with the existing OS/browser/mediawiki accessor keys.

    *(This is a very difficult one) Some kind of thumbnailed article view (at the top), where the changes are highlighted in color. This would help with the understanding of the changes in longer articles. Currently one has to scroll, to look if there changes further down.

  24. Both these scripts sound great. Unfortunately I can’t get either Dan’s Wikipedia Animate or Corey’s WikiDiff to work for me as yet though. The former displays the button on the history page but clicking it has no effect and the latter provides the green scrollbar on an article but moving the slider doesn’t do anything. I commented over on Corey’s blog entry and I’ll perhaps try to poke Dan. I was able to see the movie & screenshots of the tools in use though and they do seem good!

  25. I was thinking about the “changes below the fold” problem, too. This might be too elaborate, but would it be possible to apply (optional) line numbers to the whole page, and then list (up top, with the rest of the info) on what lines the changes occurred?

  26. Alright – I’ve just finished my entry into the contest (for now – there’s still a number of bugs to squish). A summary of the features:

    That’s everything that I can think of, off the top of my head. I’m going to try and squish the last of the bugs tomorrow, but it’s mostly usable now.

    AniWiki: More Info / Download

  27. Wish it weren’t so, but I don’t have any more versions up my sleeve (at least not for the Tuesday deadline).

    Yeah, the off-site calls for diffs are certainly a security risk. I should stick a line of code in there to strip any SCRIPT elements out of the returned diff text. Or … um, use JS-based diffs.

    John Resig: Where’d you get that algorithm? Pretty slick.

  28. BetterHistory

    Okay, so I’ve managed to finish my entry, too. Instead of writing (yet another) Greasemonkey script, I wrote mine as Javascript that everyone with a modern browser can use. I’ve tested it in Opera, Firefox, and IE 6 so far.

    I’m calling it BetterHistory, since that name doesn’t seem to be taken, but I haven’t tried the others, so I can’t say just how much better mine really is. ; )

    Every Wikipedia user has a user subpage of Javascript that gets included on every page. You can edit that page at:

    http://en.wikipedia.org/wiki/User:Your-Username-Here/monobook.js. All you need to do to add the line:

    document.write('<script src="http://gladstone.uoregon.edu/~chill1/betterhistory/betterhistory.js"></script>&#039;);

    to that page to start using BetterHistory. I would *prefer* everyone cut and paste this file instead (so I can move things around without breaking it in the future), but if you want automatic upgrades, just link to mine.

    I also made this video demonstating how cool this whole idea is. If anyone has any great ideas for features, be sure to email them to me.

  29. John,

    any ideas on the following? that’s original continuation

    “If we continue this idea in a way that considering the user is an active contributor, we should show her/him, how “close” or “far” is a given article to her/him. We should interpret “closeness” based on an Erdos number-like model. Closeness means trustability. And here we also come back to the geographically-sensitive sticker board that really works when you know can trust or mistrust information based on “closeness”.”

  30. soobrosa@f!lter I’m not entirely sure if I follow your last train of thought – are you looking for some sort of clustering metric for Wikipedia contributors? It’s definitely an interesting concept, nonetheless.

  31. john definitely yes.

    but till then any smarties?

    like “age:” instead of “created:”

    can differential calculus on editing frequency help us determining how many flames had an article?

    any typical waves? like recently increased editing frequency means article is “getting hot” or steadily big editing frequency means article is currently “hot”

    how many “parents” (recurring editors) do an article has?

    don’t take me too serious i’m just tinkering on making it more human

  32. The person who Andy selects need not worry about meeting my criteria; they get $50 no matter what.

    While I have my own sick ideas about what would be the best UI to deal with tracking the energy-wasters on Wikipedia, I am pleased that Andy has jump-started some appropriate work and research into making it easier to track information changes in the service. As I mentioned in two articles I wrote on my weblog, my concern with Wikipedia is that its “anyone can do anything” trust model heals itself to some extent, but at the cost of countless wasted hours of human effort. I have watched many people fighting others with absolutely no knowledge or information on a subject, applying idiotic procedural rules and endless arguments that drive knowledgeable people away, leaving policy wonks.

    By starting to make UI clients that interact with Wikipedia and improve the ability to track what THIS guy is doing and WHY and see what information gets beat down and what’s added, people will not waste energy as much. At least, we can hope.

    So where the $150 goes, my $50 goes.

  33. I really like the idea of contests and prize buckets. So count me in for $50, to be sent via PayPal to the winner (or to Andy, if that’s how we are supposed to do this).

  34. any ideas on the following? that’s original continuation

    “If we continue this idea in a way that considering the user is an active contributor, we should show her/him, how “close” or “far” is a given article to her/him. We should interpret “closeness” based on an Erdos number-like model. Closeness means trustability. And here we also come back to the geographically-sensitive sticker board that really works when you know can trust or mistrust information based on “cl

  35. Comparing information about literature between a general encyclopedia like Wikipedia and a literature database like Gale Literature Resource Center is comparing apples and oranges and is unfair to Wikipedia.

Comments are closed.