4

My web site has a database lookup; filling out a CAPTCHA gives you 5 minutes of lookup time. There is also some custom code to detect any automated scripts. I do this as I don't want someone data mining my site.

The problem is that Google does not see the lookup results when it crawls my site. If someone is searching for a string that is present in the result of a lookup, I would like them to find this page by Googling it.

The obvious solution to me is to use the PHP variable $_SERVER['HTTP_USER_AGENT'] to bypass the CAPTCHA and custom security code for the Google bots. My question is whether this is sensible or not.

People could then use Google's cache to view the lookup results without having to fill out the CAPTCHA, but would Google's own script detection methods prevent them from data mining these pages?

Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?

Thanks in advance.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
edanfalls
  • 440
  • 7
  • 15
  • i'm no expert but i would have said that if someone wanted to mine your database, they'd figure that out? why not instead of a capatcha just limit the number of requests per second to something humanely possible? like 1 every 10 seconds or so – studioromeo Apr 12 '10 at 11:15
  • I'm pretty sure that this would fall into the "you can't send one thing to Google Bots and another thing to a user" 'category'. This kind of "blocking for normal users but not for Google" process can reduce your visibility. – Narcissus Apr 12 '10 at 11:31
  • I can't imagine any situation where your overall design would be good. If you have some publicly available information on your site, it should be visible on a page with permanent address and this page should be listed somewhere in the site navigation so that any search engine can index it. Any lookup is by definition temporary rearrangement and shouldn't be cached nor indexed. Can you explain why are you using this approach? – calavera.info Apr 12 '10 at 11:36
  • Rob, as I mentioned, in addition to CAPTCHAs (which really aren't very secure by themselves), I use custom code to detect automated scripts. – edanfalls Apr 12 '10 at 12:00
  • calavera.info, perhaps I shouldn't have used the term "lookup". Each "lookup" has its own permanent URL, and is linked to in places on the site and on external web sites, and as such Google "likes" them. I just give the appearance of a lookup on the site, since it is more appropriate to the content. – edanfalls Apr 12 '10 at 12:05
  • Narcissus, could you please elaborate what you mean by reducing visibility? Surely this is increasing visibility to Google? – edanfalls Apr 12 '10 at 12:14

2 Answers2

4

Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?

Definitely. The user agent is laughably easy to forge. See e.g. User Agent Switcher for Firefox. It's also easy for a spam bot to set its user agent header to the Google bot.

It might still be worth a shot, though. I'd say just try it out and see what the results are. If you get problems, you may have to think about another way.

An additional way to recognize the Google bot could be the IP range(s) it uses. I don't know whether the bot uses defined IP ranges - it could be that that's not the case, you'd have to find out.

Update: it seems to be possible to verify the Google Bot by analyzing its IP. From Google Webmaster Central: How to verify Googlebot

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
1

the $_SERVER['HTTP_USER_AGENT'] parameter is not secure, people can fake it if they really want to get your results. your decision is a business one, basically do you wish to lower security and potentially allow people/bots to scrape your site, or do you want your results hidden from google.

chris
  • 9,745
  • 1
  • 27
  • 27