The decision of which robots to block is a very personal matter. Blocking is available for many different offline downloaders, programs designed to spider a site and download various files on behalf of real people. Some specialize in stealing images, others are just general-purpose spiders. Fighting robots is a neverending battle with no winners, only casualties. One can never stop all abusive behavior from all automated robots and rude programs, but can minimize their effects and reduce the abuse to acceptable levels. An ip address from where the bot is originating or the user-agent string the bot uses are required before blocking.
A good deal of the hits listed in your logs are likely to come from the many programs - commonly known as robots, bots, crawlers and spiders - that automatically trawl the web for a variety of purposes, including:
• Indexing your site (e.g. Googlebot, Inktomi Slurp)
• Gathering statistics (e.g. WebWatch)
• Site maintenance and validation (e.g. LinkWalker, W3CLinkChecker)
Here are 4 techniques for blocking unwelcome robots from accessing the site:
1. Robots.txt: To implement robots.txt create a text file called robots.txt in the root directory of your site.
2. Blocking ip addresses using (Apache only): If the ip address is known of the robot accessing the site, and the site runs on the Apache web server, the directive provides a convenient and effective method of blocking access.
3. Blocking user agents using SetEnvIfNoCase and (Apache only): This method uses the user-agent string to restrict access.
4. Blocking ip addresses or user agents using PHP: This means blocking access from within web content. This method is useful when one wants to block a bot from very specific content, or serve alternative content based on the user agent or ip address of the bot.