You learn an awful lot when you run your own site, and it’s mostly driven by what people try to do to you. This week’s lesson was about ‘dark crawlers,’ web spiders that don’t play by the rules.
Any program that crawls the web for information (mostly search engines), is supposed to follow the robots exclusion standard kept in the robots.txt file. Not every part of a web site is suitable for crawling — for example, the /cgi-bin/mt-comments.cgi… links on my site would just contain redundant information already in the main articles and would place extra load on the server — so you use that file to tell crawlers which parts of your site not to bother loading. You can also use it to stop crawlers that are behaving in ways you don’t like (e.g., polling too often) based on a unique part of their user agent name.
Adherence to that standard is entirely voluntary though, and some crawlers ignore it entirely. Yesterday I noticed a large number of hits coming from two class-C subnets. Although they appeared to be regular web browsers by their user agents (one Mac, one Windows), they were very rapidly working their way through every article on the site, in numerical order. A Google check on the IP address range quickly revealed that they were part of a company called ‘Web Content International’ which is apparently notorious for this kind of dark crawling.
A couple of new firewall rules to drop packets from them stopped their crawling, but then my kernel logs were getting flooded with firewall intrusion notices. They were apparently content to sit there retrying every few seconds for however long it took to get back in. Allowing them back in and having Apache return a 403 Forbidden page to requests from their address ranges instead seemed to finally make them stop.
Is that kind of crawling really that bad? Crawling is a natural web activity now, the site is public information, and these guys probably don’t intend any harm, so it doesn’t seem *too* bad. There are worse offenders out there too, such as dark crawlers that specifically go into areas marked as ‘Disallowed’ in robots.txt even if they weren’t part of the original crawl, hoping to find juicy details like e-mail addresses to spam. Still, if they want to crawl they should be open about admitting they’re doing so by following the robot standards. If they’re going to be sneaky about getting my data, I’m perfectly within my rights in being sneaky in denying it to them.
Aha!
Morgy + guerilla warfare + web polling = Black e-Ops? :-)
That’s the problem, these days. Nobody follows standards or rules anymore…