10

I'm trying to block all bots/crawlers/spiders for a special directory. How can I do that with htaccess? I searched a little bit and found a solution by blocking based on the user agent:

RewriteCond %{HTTP_USER_AGENT} googlebot

Now I would need more user agents (for all bots known) and the rule should be only valid for my separate directory. I have already a robots.txt but not all crawlers take a look at it ... Blocking by IP address is not an option. Or are there other solutions? I know the password protection but I have to ask first if this would be an option. Nevertheless, I look for a solution based on the user agent.

testing
  • 19,681
  • 50
  • 236
  • 417

3 Answers3

20

You need to have mod_rewrite enabled. Placed it in .htaccess in that folder. If placed elsewhere (e.g. parent folder) then RewriteRule pattern need to be slightly modified to include that folder name).

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]
  1. I have entered only few bots -- you add any other yourself (letter case does not matter).
  2. This rule will respond with "403 Access Forbidden" result code for such requests. You can change to another response HTTP code if you really want (403 is most appropriate here considering your requirements).
LazyOne
  • 158,824
  • 45
  • 388
  • 391
  • 3
    Where do I get a list of bots? Can I use `RewriteRule /var/www/html/myweb/.* - [R=403,L]`? – testing May 24 '12 at 14:53
  • 1) For example -- Check your server logs, browser string field -- analyze them somehow and extract unique part from there to identify the bot (should not be a problem after you see few examples). maybe there a such list already exists, but I never bothered with this; 2) No, you cannot use physical path there (path part of actual URL is expected there -- please consult manual if necessary -- http://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewriterule ) – LazyOne May 24 '12 at 16:12
12

Why use .htaccess or mod_rewrite for a job that is specifically meant for robots.txt? Here is the robots.txt snippet you will need to block a specific set of directories for search crawlers:

User-agent: *
Disallow: /subdir1/
Disallow: /subdir2/
Disallow: /subdir3/

This will block all search bots in directories /subdir1/, /subdir2/ and /subdir3/.

For more explanation see here: http://www.robotstxt.org/orig.html

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    Check original question: "...I have already a robots.txt but not all crawlers take a look at it ..." – LazyOne May 25 '12 at 12:29
  • @LazyOne: I would be keen to know what crawlers do ignore robots.txt? – anubhava May 25 '12 at 12:53
  • 3
    Check your web server logs -- you will find them. Of course -- big names (like Google, Bing etc) will not do that, but **some** smaller (or fake ones) quite often requesting pages that are prohibited in robots.txt (for example customer account area etc, where user must be logged in / content is specific to that user only). If OP wants to deal with them -- then why not -- it's his time. – LazyOne May 25 '12 at 13:14
  • @LazyOne Sorry I don't have any server as I am not an admin. You said `some smaller (or fake ones)` violate this protocol of robots.txt. My question is how are you going to get all of their names in RewriteCond to block them? Even if you are able to get some of them today in RewriteCond, what is the guarantee that new fakes will not come in future. Will you keep adding them in RewriteCond forever? – anubhava May 25 '12 at 13:19
  • Me? I will be doing nothing! It's OP who wants it -- ask him what he will do (if he happy to keep an eye on such list and update it on every occasion. – LazyOne May 25 '12 at 13:42
  • @LazyOne: You or OP it doesn't matter. Whosoever has to live through such an unmaintainable solution will realize very soon to keep updating a monster list like that (an eventually hit the max length of RewriteCond line) or work with widely accepted standard solution for these problems i.e. use `robots.txt` and not be bothered about fake crawlers. – anubhava May 25 '12 at 14:11
  • It will be a problem for that person who will be using it. It's his/her choice. "I will believe when I see it" -- I'm sure you've heard that. Until OP will not try this (and maintain the list for some time) he will not realise all possible complications that he *may* face. – LazyOne May 25 '12 at 14:29
  • 2
    Of course, if a bot is ignoring robots.txt, it might be forging its HTTP_USER_AGENT. This is only stopping bots who don't lie about their identity, but also willfully ignore robots.txt Another reason to want this, as opposed to robots.txt: if you're not the webmaster, you may not be able to change robots.txt, but you can place a .htaccess in your own directory. – not-just-yeti May 04 '14 at 19:02
  • 1
    Both `.htaccess` and `robots.txt` go inside individual site's `DocumentRoot`. If someone can place a `.htaccess` then he/she can very well place `robots.txt` as well under same `DocumentRoot`. – anubhava May 05 '14 at 07:12
  • 1
    robots.txt link went bad – daslicious May 09 '17 at 22:03
7

I Know the topic is "old" but still, for ppl who landed here also (as I also did), you could look here great 5g blacklist 2013 (08/2023 update: 7G firewall / 8G firewall beta).
It's a great help and NO not only for wordpress but also for all other sites. Works awesome imho.
Another one which is worth looking at could be Linux reviews anti spam through .htaccess (last functional archived link).

quantme
  • 3,609
  • 4
  • 34
  • 49
Charles
  • 212
  • 4
  • 11