CMU's ClueWeb09, 1 billion website crawl available for researchers

massive 25 terabyte dataset shipped on four 1.5 terabyte drives; get this up on AWS!