My web sites don’t get a lot of traffic. One of them dealing with the birther movement was getting 40,000 unique visitors a month, but after I stopped publishing new articles, that dropped sharply. What continued was search engine spiders (bots) crawling and indexing the pages. At least 90% of my traffic is from search engines.
One of the bots that tore up one of my sites, accessing the same page over and over a thousand times, was SemRushbot. The amount of web traffic it generated was staggering and I filed a complaint with them for damages. Because of that abusive bot, I’ve been watching the spider traffic more closely and identified another bot that is spending a lot of time on my website, DotBot.
Neither of the two bots is a search engine. One supposedly monitors ad campaigns for a site’s competitors, and the other has to do with eCommerce. None of my sites has ad campaigns or any kind of eCommerce. Those bots have no business on my sites.
The standard way to stop a bot is to ask it nicely to go away. That’s done with the robots.txt file. The problem with that approach is that the spider can just ignore the file and crawl your site anyway or it may take some time for the spider to find out that you’ve changed the file.
In the case of SemRushbot, it appears that it does respect robots.txt because in the last 24 hours on the site where that bot caused so much trouble I found that it had accessed the robots.txt file 13 times, sometimes twice in the same minute, but that is the only file it accessed. DotBot is not so cooperative. It accessed the robots.txt 30 times, but ignored it and accessed 233 other pages–it didn’t get them though. I use the WordFence plugin on all my sites and one of its features is the advanced blocking capability of banning a user agent. All the DotBot traffic was rejected with an error code. Another bot that spends a lot of time on my site and provides no value is AhrefsBot, and I block it too.
The most prolific bot on my server right now is BingBot for the Bing search engine. That’s fine because I want people to be able to find my site if they want to. GoogleBot is there about half as much.
On my largest site I have added the location of my sitemaps.xml file in the robots.txt file. That contains the date the posts were last updated and hopefully the spiders will be smart enough not to re-scan pages that haven’t been updated.