2

I have prepared a .htaccess file and have placed it in a directory with pdf files to prevent hotlinking except from my site as follow:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteRule ([^/]+)\.(pdf)$ http://www.example.com/search_gcse/?q=$1 [NC,R,L]

This rule works as expected. If the link come from an external file, the request is redirected to my search page where the platform search for that (and similar) file.

So, when I search in Google, the results showed by google (which have been already indexed) are redirected to my search page (that's fine). Now, I'm concerned with the next time Google will indexes my site. So, I added a new rule as follow:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteCond %{HTTP_USER_AGENT} !(googlebot) [NC]
RewriteRule ([^/]+)\.(pdf)$ http://www.example.com/search_gcse/?q=$1 [NC,R,L]

However, I'm not sure if that rule is working, and what is the way to check it. If I try access a file from google search results, I'm still redirected to my search page, so it's not affect google search results.

Will this rule allow google to index my new pdf files, but prevent from a direct access from the google search result page ? If not, what is the correct way to achieved this?

hjpotter92
  • 78,589
  • 36
  • 144
  • 183
pQB
  • 3,077
  • 3
  • 23
  • 49
  • Your first code block will allow Google bots to index the files, but users will be redirected to search page when trying to access the PDF from Google results. – hjpotter92 Sep 22 '15 at 10:01
  • @hjpotter92 Oh, I think I got it... google bot indexing does not access as a reference but as a different condition ? May you specify it a bit more as an answer? I'll accept and upvote it. – pQB Sep 22 '15 at 10:12

1 Answers1

1

While your htaccess rules will disallow hotlinking; it would not work well with the search indexers and other robots. The search engines would still be able to index your files.

In order to disallow search engines from indexing your files; you'd need to pass X-Robots-Tag header. Google provides a small documentation on how to prevent robots from indexing/caching/archiving a page it has crawled.

<Files ~ "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</Files>
hjpotter92
  • 78,589
  • 36
  • 144
  • 183