Automating Wikipedia History

I’d like to suggest you use fundable.org for this.

I considered it, but Fundable doesn’t allow you to create a group action without a target amount or set contribution. Ideally, I’d want to allow people to just continue to kick in arbitrary amounts of money until someone decides to claim the bounty. If nobody else contributes, the bounty is still seeded with my $50.

Is there a deadline?

I’d like to throw $20 in the pool.

How about next Tuesday?

I’ll contribute a free Flickr Pro account.

Hey there, Andy.

You know my opinion on Wikipedia. A lot of it comes from my research, and I am all for a tool like this coming out, ESPECIALLY if it allows everyone to totally break out the tracking of wikipedia work, and be able to tell when information is pulled away. Basically, a Wikipedia client that works to Wikipedia like a graphical IRC client works with IRC. I offer $50 to this goal.

One avenue is to contact the history flow guys at IBM to find out if there is a more granular version. History flow seems to do what you want at a more macro time scale. Being able to navigate that flow seems like a natural add on.

Although I definitely want a readable snapshot of the webpage itself, rather than a charted visualization. See Jon Udell’s screencast for a better example of what I mean.

i’ll kick in ten bucks for this cause. this sounds like an awesome project. e-mail me for some paypal action.

Some ideas for you hackers:

http://www.alphaworks.ibm.com/tech/historyflow

I would like to add another $20 to the pool.

Cool idea. I can chip in $50 (although I will be at a conference most of next week).

Doing this is fairly straightforward with xmlHTTPRequest. You create a javascript bookmarklet, and when you visit the article and click on it, it downloads the article history (simply http://en.wikipedia.org/w/index.php?title=Article&action=history), using a regular expression to filter out URLs to the last 500 edits and storing them in an array. You then concatenate the regexes into full urls and use a for loop based on the length of the array to download a buffer of old articles (probably skip ten at a time) and save them to another array. After enough have been gathered (akin to when your media player buffers so it doesn’t lag) it erases the page and replaces it with the html it garnered in the first xmlHTTPRequest, and after 1 second it just ups the array indice etc…, while simultaneously downloading the rest.

I don’t have time to implement this, but if I did I would have an absolute positioned div across the top, akin in size to the blogger navbar, and it would contain some basic options, such as number of articles to buffer before beginning the show etc.. For the articles that I downloaded and saved in the buffer, I would extract only the contents of div id=”bodyContent”. Below my navbar, the entire rest of the document would become this div.

This is actually quite a lot of hard work. I have lying around here somewhere a python script which downloads the article histories and scrapes them into a format that HistoryFlow recognizes (since the current version seems to have no automatic way of doing this), and that is suitable enough for me, personally.

Brian: I’ve actually already finished everything you mentioned. A couple differences from what you mentioned, however:

1) You can configure how many revisions you wish to skip at a time (along with how time, in seconds, for page delay).

2) You can change the start/end date (and even go in reverse).

3) Rather then replacing a DIV, I simply replace the entire Body element with the new contents – it’s suprisingly fast. (Although, I may change this in the future)

4) The problem is that you can’t do it with a bookmark (due to browser security issues), so it has to be done using either a server-side script (not portable) or a Greasemonkey script, like what I’m using (so far).

Sounds great, can’t wait to see your work. In the meantime i’ll spruce up this history flow script since that seems wanted and add a Flickr pro account to the bounty while i’m at it.

I put together a quick example of what I think you want last night. You can check out the ‘beta’ version at my blog

There are some problems that I describe in the above link, but I think it is pretty cool for just a few hours of hacking stuff together.

I’m hoping that whatever tool/script/etc. gets developed, it will work on various wikis, not just wikipedia. I know at least one other heavily edited wiki that I’d love to see the animated history of.

Corey: I tested out your script and there’s a couple problems. I tested it first and it crashed my browser twice – not entirely sure why, so I restarted and started Firefox with a new profile. It doesn’t crash now, but I get a Javascript error saysing ‘rawRevisions has no properties’, Line 300. Another issue is that you’re loading all of the revisions every time you hit any Wikipedia entry, which is rather harsh (I have yet to get a list of yours to load completely). It should definitely be optional.

dave bug: Excellent suggestion, I’ll see if I can’t work it into mine. Do you happen to know what Wiki software they’re using? Or what web site it is? This would help when it comes to testing.

John: Thanks for looking at it. What version of Firefox/Greasemonkey are you using? I’d like to install those versions on my machine to see if it crashes on first install. I’ve tried it on Firefox 1.0.4/Greasemonkey 0.3.3 and it works find for most pages.

Also, there is a hide link on the revision element. If you click that it will not load all the revisions each time (And it will remember it’s last state). This way you can use the diff tool only when you. It default state is to be shown since I figured most people would want to try it out after they download it.

Corey: I think the reason why it crashed the first time was due to the fact that I had left my Wikipedia-mod script running, and they probabaly collided (some how). I was running Firefox 1.0/Greasemonkey 0.3.3 – I just upgraded to 1.0.4 – and no luck getting it to run on either (this is on Windows). I can test it on my Mac when I get home tonight.

Yeah, I noticed the ‘hide’ link after I posted that last comment, although, in my defense, blue-on-blue is kind of hard to read, hehe.

I regret to affirm that WikiDiff crashed my browser, too. It worked absolutely fine on the first page where I tried it; then I went to another page, and instead of displaying the “loading” script, it displayed the page (my talk page) with no content, then after 10 seconds or so the browser crashed. I’m using Firefox 1.0.4 and Greasemonkey 0.3.3, and both have been stable for at least the last week.

dave bug: There are literally hundreds of wiki implementations. The version that Wikipedia is built on, which they call MediaWiki, is specialized to meet the needs of an open encyclopedia. If the script is looking at the History page and screen scraping in some fashion, I wouldn’t expect that to work in other wikis. I certainly wouldn’t expect one person to write a generalized script which would encompass all possible History page formats. Just sayin’. I’d imagine that a working Mediawiki script could be fairly easily modified to work on another wiki, but that’s a different kettle of fish.

On my setup (Firefox 1.0.4/Mac, GM 0.2.6) WikiDiff starts doing something (a blue window appears), but then it just kind of sits there. Maybe I should be more patient?

I’ve taken a shot at implementing this, but my Greasemonkey script expects you to visit the history page first, and then does the animation from there.

As for generalizing this sort of thing — I for one would love to see some kind of standardized API for pushing data into and getting meta-data out of different wiki engines.

Dan: I really like yours, at first I was kind of miffed about having to go to the history page, but it makes sense in context. (And I like the addition of the Animate tab, heh) My qualms (at the moment) are that when the pages are loading it flashes something fierce (making it unviewable, until completely loaded) – and you can’t configure how long you wish to wait on a particular page (1s, 5s – it just jumps straight to the next page).

Thanks for the feedback John — I updated my script to allow for different animation speeds and a mechanism for skipping minor edits.

I’ll have to look into optimizing the way stuff gets displayed to reduce the flicker. Right now I’m just replacing element.innerHTML — maybe I should cache the next upcoming frame to let the browser render it first?

John, Dan: Luckily, the specific wiki I was thinking of is also using MediaWiki. It makes sense that this script wouldn’t inherently work with other wiki implementations (though an overall API would rock, and is a great idea), but hopefully it’d at least work with various MediaWiki wikis.

Unluckily, the site I’m talking about is password protected. Plenty of other open ones to test it out on, though:

http://meta.wikimedia.org/wiki/Sites_using_MediaWiki

Well, just tried it out on the wiki I was talking about (after adding the url to the allowable ones), and it seems to work great. It was pretty flickery, but I just paused and let them all load. Excellent job, I look forward to seeing future changes.

Pretty swanky, Dan. The draggable playbar is really fresh.

I’m currently working on a similar approach, though it uses a few server-side scripts to work. This removes the need for Greasemonkey (great for Safari users), and it’s allowing me to add a few niceties like sparklines.

Looking slick Dan… I agree with your idea of getting some API out of this. The hardest part was figuring out how to get the data.

Anyway, I reworked my greasemonkey script this afternoon. You can get the new one here The crashing problem that was causing havok should be fixed now.

I’ve compiled WikipediaAnimate into a regular firefox extention for those who don’t have Greasemonkey. XPI File

DAN!

That diff option is amazing! I’m still trying to figure out how you did that.

Yeah, it definitely helps visualize the changes. I can’t take too much credit though — most of the heavy lifting is done by Aaron Swartz’s HTML Diff script.

I’d love to see something that works outside of Firefox please. I love the browser and all, but I’m on Safari 80% of the time nonetheless.

I’ve been testing my script on PithHelmet and I think with a little more tweaking I’ll be able to support Safari users.

I’ll contribute a Socialtext Starter package (5 users, 1 year, $495 value)

Dan: Can you update your script to show ‘>’ buttons?

I meant ‘<<‘ and ‘>>’ buttons

If there’s an enthusiastic coder out there, I would drop an other hundred bucks for something like this:

“Every Wikipedia page has some metadata-like attributes those are notshown, or not easily recognizable, but stored in the database. These are the following

– was the content of the page discussed ever, by anyone,

– when was the page started,

– how many people contributed to the page,

– how many edits were made to the page,

– were there any major flame or vandalism regarding the content of the page,

– how many other pages link here.

I propose that if every Wikipedia page would have a graphical representation of these data or they relation to each other, that would help the user to have an immediate opinion of how much should she/he should trust that page. (Font size, coloring, shading could be easily done even within HTML using CSS – no need for special graphic generation methods.)”

I created a 42mb avi video (link to page to get it) of the entire Heavy metal umlaut history using Wikipedia Animate. I couldn’t find a video editor that didn’t trash the resolution so I just left it as is for now. If someone wanted to clean up the end that’d be swell. Here’s the simple AutoIt 3 code I used to move the mouse for me (no, i’m not that patient nor coordinated =)

AutoItSetOption(“WinTitleMatchMode”,2)

Sleep(15000)

WinActivate(“Heavy”)

If NOT WinActive(“Heavy”) Then WinWaitActive(“Heavy”)

If WinActive(“Heavy”) Then

MouseClickDrag(“left”,225,147,619,148,100)

MouseClickDrag(“left”,619,148,225,147,100)

EndIf

WikipediaAnimate is really nice, and very useful already.

Two things would help even more IMHO:

*Accessor keys – For incremental stepping back/forward and pause/play through the article history. One should be careful to not conflict with the existing OS/browser/mediawiki accessor keys.

*(This is a very difficult one) Some kind of thumbnailed article view (at the top), where the changes are highlighted in color. This would help with the understanding of the changes in longer articles. Currently one has to scroll, to look if there changes further down.

Both these scripts sound great. Unfortunately I can’t get either Dan’s Wikipedia Animate or Corey’s WikiDiff to work for me as yet though. The former displays the button on the history page but clicking it has no effect and the latter provides the green scrollbar on an article but moving the slider doesn’t do anything. I commented over on Corey’s blog entry and I’ll perhaps try to poke Dan. I was able to see the movie & screenshots of the tools in use though and they do seem good!

I was thinking about the “changes below the fold” problem, too. This might be too elaborate, but would it be possible to apply (optional) line numbers to the whole page, and then list (up top, with the rest of the info) on what lines the changes occurred?

Alright – I’ve just finished my entry into the contest (for now – there’s still a number of bugs to squish). A summary of the features:

Advanced Wiki Entry Info (for soobrosa@f!lter)
Graphical Timeline and Sliders (for Michal Migurski and Andy)
Automate in Reverse, view the slideshow going forwards, or backwards, in time.
In-Page Diff(s) – Implemented completely in Javascript, much faster, and more secure, then Dan‘s implementation (no offense). (for Andy, et. al.)
Below (and Above!) the fold diff notification (for Jama and dave)
Configure slideshow duration – and skip every N entries
Choose Specific Entries (similar to Dan‘s entry)
Keyboard navigation (for Jama)
Estimated Slideshow duration
No limit on rgw number of revisions to show

That’s everything that I can think of, off the top of my head. I’m going to try and squish the last of the bugs tomorrow, but it’s mostly usable now.

AniWiki: More Info / Download

I spent some time tonight fixing some bugs in wikiDiff. You can read more about it here

The big changes are

No more flashing between frames.
Previous/Next buttons
Access Keys ALT-P(revious) ALT-N(ext) ALT-F(irst) ALT-L(ast)
Will animate forward and in reverse

Wish it weren’t so, but I don’t have any more versions up my sleeve (at least not for the Tuesday deadline).

Yeah, the off-site calls for diffs are certainly a security risk. I should stick a line of code in there to strip any SCRIPT elements out of the returned diff text. Or … um, use JS-based diffs.

John Resig: Where’d you get that algorithm? Pretty slick.

BetterHistory

Okay, so I’ve managed to finish my entry, too. Instead of writing (yet another) Greasemonkey script, I wrote mine as Javascript that everyone with a modern browser can use. I’ve tested it in Opera, Firefox, and IE 6 so far.

I’m calling it BetterHistory, since that name doesn’t seem to be taken, but I haven’t tried the others, so I can’t say just how much better mine really is. ; )

Every Wikipedia user has a user subpage of Javascript that gets included on every page. You can edit that page at:

http://en.wikipedia.org/wiki/User:Your-Username-Here/monobook.js. All you need to do to add the line:

document.write('<script src="http://gladstone.uoregon.edu/~chill1/betterhistory/betterhistory.js"></script>');

to that page to start using BetterHistory. I would *prefer* everyone cut and paste this file instead (so I can move things around without breaking it in the future), but if you want automatic upgrades, just link to mine.

I also made this video demonstating how cool this whole idea is. If anyone has any great ideas for features, be sure to email them to me.

I also made a page here with better installation instructions, and more information.

Dan: I wrote it myself, I read a paper called “A technique for isolating differences between files” and implemented it from there. I’m going to be releasing the code that I wrote, soon, so everyone will be able to use it.

John,

any ideas on the following? that’s original continuation

“If we continue this idea in a way that considering the user is an active contributor, we should show her/him, how “close” or “far” is a given article to her/him. We should interpret “closeness” based on an Erdos number-like model. Closeness means trustability. And here we also come back to the geographically-sensitive sticker board that really works when you know can trust or mistrust information based on “closeness”.”

soobrosa@f!lter I’m not entirely sure if I follow your last train of thought – are you looking for some sort of clustering metric for Wikipedia contributors? It’s definitely an interesting concept, nonetheless.

john definitely yes.

but till then any smarties?

like “age:” instead of “created:”

can differential calculus on editing frequency help us determining how many flames had an article?

any typical waves? like recently increased editing frequency means article is “getting hot” or steadily big editing frequency means article is currently “hot”

how many “parents” (recurring editors) do an article has?

don’t take me too serious i’m just tinkering on making it more human

The person who Andy selects need not worry about meeting my criteria; they get $50 no matter what.

While I have my own sick ideas about what would be the best UI to deal with tracking the energy-wasters on Wikipedia, I am pleased that Andy has jump-started some appropriate work and research into making it easier to track information changes in the service. As I mentioned in two articles I wrote on my weblog, my concern with Wikipedia is that its “anyone can do anything” trust model heals itself to some extent, but at the cost of countless wasted hours of human effort. I have watched many people fighting others with absolutely no knowledge or information on a subject, applying idiotic procedural rules and endless arguments that drive knowledgeable people away, leaving policy wonks.

By starting to make UI clients that interact with Wikipedia and improve the ability to track what THIS guy is doing and WHY and see what information gets beat down and what’s added, people will not waste energy as much. At least, we can hope.

So where the $150 goes, my $50 goes.

So… Who won?

I was waiting to hear back from the prize volunteers. I’ll announce a winner Monday morning.

I really like the idea of contests and prize buckets. So count me in for $50, to be sent via PayPal to the winner (or to Andy, if that’s how we are supposed to do this).

Wikipedia is currently locked since they’re updating to 1.5. Any chance/risk this will break any of the above entries?

http://meta.wikimedia.org/wiki/MediaWiki_1.5_upgrade

I am one of the authors of History Flow.

There’s no need to screen-scrape wikipedia histories; you can grab the whole history of an article in one blow, and reduce the load on the Mediawiki renderfarm, if you use the URL http://en.wikipedia.org/wiki/Special:Export in a sufficiently clever manner.

any ideas on the following? that’s original continuation

“If we continue this idea in a way that considering the user is an active contributor, we should show her/him, how “close” or “far” is a given article to her/him. We should interpret “closeness” based on an Erdos number-like model. Closeness means trustability. And here we also come back to the geographically-sensitive sticker board that really works when you know can trust or mistrust information based on “cl

Comparing information about literature between a general encyclopedia like Wikipedia and a literature database like Gale Literature Resource Center is comparing apples and oranges and is unfair to Wikipedia.

Comments