At one point last weekend, our server was going slow, so I went out to do something about it. As it turned out, there was no single reason for the slowdown, it was pretty much just heavy traffic. However, I couldn't help noticing that user agent, Gigabot/3.0, doing a lot of requests. Since this was sort of an emergency and our website is for our users, not some random bot, I blocked it on the spot.
Later, when the traffic was down to a normal level again, I thought about unblocking it. That's when I noticed that a) it was still doing a lot of requests and didn't seem to take 403 for an answer and b) that it was accessing URLs which are listed in our robots.txt.
Now who is this Gigabot anyway? Apparently, it's the bot for a search engine at www.gigablast.com. Their homepage has this marketing blurb:
With one of the largest and freshest indexes in the world, Gigablast Inc. has recently joined the elite ranks of major search engine companies. Despite that, I've only come across them before when I first looked up Gigabot a few month ago and I've never heard of anyone actually using that search engine ...
According to the page about Gigabot (which isn't easily found, btw) their bot does obey to the robots.txt. So why did it not do that in our case?
The Gigabot we're seeing appears to be genuine. It's coming from IPs in the 64.62.168.xx range and gigablast.com itself is at 220.127.116.11. So this must be their spider. The most probable explanation I found was in an old thread on the WebmasterWorld forums: The bot doesn't understand the "shortcut" version of the robots.txt.
When listing more than one URL (or more than one bot) you want to exclude, you can do it in two ways. You either list every URL on its own, like this:
Or you can list them in a block, like so:
Both forms should be equivalent and every bot that I've come across so far (and that did respect the robots.txt) was able to understand the more compact second format.
Not so, according to that forum post, Gigabot. On top of that, it's also causing too much traffic, IMHO. It should be going much slower than it currently does. Strange how it's always those b-rank search engines who think they can get away with this sort of bad behaviour ...
After blocking the bot for a few days, its request dropped to a normal level. So I decided to test the above theory and added this to the top of our robots.txt:
I.e. denying it access to the site (and all the rules for the other bots went below that). I then unblocked it. The next time it came around, the first thing it did was to get the robots.txt. And that's all it has been requesting since, so it seems to help ...