Unethical idea of the day: ‘The Forbidden Web,’ a search engine that only indexes files disallowed by robots.txt files. For example, CNN’s robots.txt file asks search engines to avoid their transcripts, jobs, website statistics, and development directories. The Forbidden Web would index only those forbidden (and often intriguing) directories. Evil, isn’t it?
A glance at the robots.txt files on some popular sites: New York Times, Google, Hotwired, eBay, Slashdot, Verisuck, Kuro5hin, Filepile, ZDNet, Epinions, IMDB, BBC, IBM, USA Today, Jakob Neilsen.
You can search Google for more robots.txt files.