3

I have some geo targeting code whcih I want to behave in a particular way if the site is being spidered by a robot e.g. google etc.

Is there any way to infer this?

AJM
  • 32,054
  • 48
  • 155
  • 243

4 Answers4

5

Presenting different content to search engine crawlers and human visitors - called cloaking - is a risky thing, and can be punished by the search engine if detected.

That said, check out this SO answer with several links to well-maintained "bot lists". You would have to parse the USER_AGENT string and compare it against such a bot list.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
2

You can do it by checking for the user-agent, or the IP. It may be preferable to use the latter as it's not unknown for other, less reputable bots, to spoof the user-agent of the big guys. Even for google et al their IPs tend to be in narrow ranges, so detecting on IP shouldn't require compiling of vast lists.

Richard H
  • 38,037
  • 37
  • 111
  • 138
1

You can check this by the user-agent property. For more info on user agent strings, check here: http://www.user-agents.org/ Mark the records with type "R = Robot, crawler, spider ". Bit this is not guaranteed, the user-agent property might be changes by several factors and this is not 100% reliable.

anthares
  • 11,070
  • 4
  • 41
  • 61
1

If you are only interested in the well set up reputable bots e.g. Google, Yahoo, MSN/Live/Bing/whatever-it-is-today, Ask etc then you can use round trip DNS checking.

1) Check for known user agent (look for known substring such as googlebot)
e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

2) Do a reverse DNS for the requesting IP and check that it comes from a reasonable domain.
e.g. rdns of 66.249.71.202 is crawl-66-249-71-202.googlebot.com (so happy that it comes from googlebot.com)

3) On it's own step 2 can be faked, so now check the dns of the A record for the result returned in step 2 and ensure you have the original requesting IP.
e.g. dns for above is
crawl-66-249-71-202.googlebot.com. A 66.249.71.202

66.249.71.202 was the requesting IP address so this is a valid googlebot.

status203
  • 876
  • 6
  • 11