1

I have a secret folder in my website and I don't want search engines to know about it. I didn't put the folder name in the Disallow rule of robots.txt because writing this folder name in robots.txt means telling my visitors about that secret folder.

My question is, will search engines be able to know about this folder / crawl it even if I don't have any links published to this folder?

zoora
  • 65
  • 2
  • 8

2 Answers2

2

The only truly reliable way to hide a directory from everyone is to put it behind a password. If you absolutely cannot put it behind a password, one band-aid solution is to name the folder something like:

http://example.com/secret-aic7bsufbi2jbqnduq2g7yf/

and then block just the first part of the name, like this:

Disallow: /secret-

This will effectively block the directory without revealing its full name. It will prevent any crawler that obeys robots.txt from crawling the directory, but it won't make the directory easy for hostile crawlers to find. Just don't mistake this for actual security. This will keep the major search engines out. There are no guarantees beyond that. Again, the only truly reliable way to keep everyone out of a secret directory is to put the directory behind a password.

plasticinsect
  • 1,702
  • 1
  • 13
  • 23
-1

Yes they can crawl it.

Your folder is not "secret" at all. Do a quick search for a curl command line to download the whole site then try it on your site to convince yourself your security approach is invalid.

Here is a good example: download allfolders subfolders and files using wget

You can you .htaccess to prevent agents being able to request the directory listing, and this will probably protect you fairly well if you don't give your folder an obvious name like "site", but I'd test it. see deny direct access to a folder and file by htaccess

John3136
  • 28,809
  • 4
  • 51
  • 69
  • So technically there's nothing we can do to hide it? All we can do just to protect it, right? – zoora Jul 18 '17 at 03:40
  • Yes, that is how you should approach it. – John3136 Jul 18 '17 at 03:44
  • Ok thanks. Anyway, will they able to index all files inside this folder too? – zoora Jul 18 '17 at 03:47
  • Search for "curl crawl site": and trythe command: e.g. `wget --wait=9 --recursive --level=2 http://example.org/` (only goes 2 levels deep) – John3136 Jul 18 '17 at 03:53
  • If the URL really is not linked anywhere, tools like wget wouldn’t find it. You (search engine bots, tools, etc.) would have to *guess* the URL. – unor Jul 18 '17 at 13:44
  • @unor umm really? – zoora Jul 18 '17 at 13:47
  • @zoora: Is that surprising? If you create something random (like `/283.444-44ttuzZZ792_-347nj_dfaASh2/`) and make sure not to publish it *anywhere* (which includes: not visiting it with certain browser add-ons that log visited pages; not sending a Referer; etc.), how should anyone be able to find it (without trying every possible string)? – unor Jul 18 '17 at 13:51
  • @unor. If you can read the directory you can read all of it's children - linked or not. There is more to the web than `index.html` See https://superuser.com/questions/655554/download-all-folders-subfolders-and-files-using-wget. I've used it to prove to an organization that there supposedly secure site was wide open for anyone who cared to look. – John3136 Jul 18 '17 at 23:08
  • @John3136: I don’t see how that question is relevant. It uses wget, which follows links. If something is not linked/embedded, wget won’t find it. It works for the OP because they are using index files that list the whole content of a directory. – unor Jul 19 '17 at 09:32