0

I like to disallow everything except:

  1. All files in the web root
  2. Specified directories in the web root.

I have seen this example at this answer

Allow: /public/section1/
Disallow: /

But does the above allow crawling of all files in web root? I want to allow all files in web root.

Community
  • 1
  • 1
Doochz
  • 1,039
  • 2
  • 14
  • 25

1 Answers1

0

If you want to disallow directories without disallowing files, you will need to use wildcards:

User-agent: *
Allow: /public/section1/
Disallow: /*/

The above will allow all of the following:

http://example.com/
http://example.com/somefile
http://example.com/public/section1/
http://example.com/public/section1/somefile
http://example.com/public/section1/somedir/
http://example.com/public/section1/somedir/somefile

And it will disallow all of the following:

http://example.com/somedir/
http://example.com/somedir/somefile
http://example.com/somedir/otherdir/somefile

Just be aware that wildcards are not part of the original robots.txt specification, and are not supported by all crawlers. They are supported by all of the major search engines, but there are many other crawlers out there that don't support them.

plasticinsect
  • 1,702
  • 1
  • 13
  • 23
  • Thank you. The above seems to do what I want, after using Google's tester at webmaster tools. However, I am unsure of this edge case... When I browse to `http://example.com/somedir`, my webserver is configured to provide the trailing slash and serve up the index.html (if it exists) automatically. The robots tester says `http://example.com/somedir` is allowed, but `http://example.com/somedir/` is blocked. So does this all mean that the `index.html` within `somedir` will not be seen by the robot? – Doochz Aug 04 '15 at 02:00
  • The crawler will attempt to load `http://example.com/somedir`, and will get back a 301 redirect response pointing to `http://example.com/somedir/`. Any major search engine will then check the new URL against robots.txt before following the redirect, and will see that the URL is blocked, and will not follow the redirect. This is true for major search engines. I wouldn't be surprised if there were some obscure special-purpose crawlers out there that don't work this way, so YMMV. – plasticinsect Aug 04 '15 at 02:46