The Forbidden Web

Unethical idea of the day: ‘The Forbidden Web,’ a search engine that only indexes files disallowed by robots.txt files. For example, CNN’s robots.txt file asks search engines to avoid their transcripts, jobs, website statistics, and development directories. The Forbidden Web would index only those forbidden (and often intriguing) directories. Evil, isn’t it?

A glance at the robots.txt files on some popular sites: New York Times, Google, Hotwired, eBay, Slashdot, Verisuck, Kuro5hin, Filepile, ZDNet, Epinions, IMDB, BBC, IBM, USA Today, Jakob Neilsen.

You can search Google for more robots.txt files.

Comments

wow. that IS a great idea.

I wondered about the same thing back when I had to modify my robots.txt file to keep the Wayback Machine from downloading my entire site *6 times a day* (post-9/11). I bet there are quite a few bots out there that specifically look for items forbidden by the robots.txt file. To what end, I don’t know.

I actually noticed that google was robots.txt less for a while a few weeks ago

My webserver had a tough time, once, when the Danish Royal Library tried to archive everything on my site – including that script that turned the *entire* web into something much like pig latin. The Royal Library has the specific mandate to archive everything written in danish, so they didn’t feel a need to respect robots.txt files for sites in the .dk top level. They kindly backed off when I told them they were getting copies of the entire world.

Comments are closed.