Detecting a crawl (Search Engine's visit) using PHP

Question

When a search engine visits a webpage, what does get_browser() function and $_SERVER['HTTP_USER_AGENT'] return?

Also, what is the other possible evidence that PHP offers when a search engine crawls a webpage?

Why does it matter? If you serve them different content you're violating their TOS and risk being banned. — John Conde, Jun 01 '12 at 16:29
http://stackoverflow.com/questions/677419/how-to-detect-search-engine-bots-with-php — David, Jun 01 '12 at 16:33
Before the question gets closed as exact duplicate, can any one tell me what `get_browser()` returns? — Tabrez Ahmed, Jun 01 '12 at 16:36
@TabrezAhmed - At the end of the day it is electical impulses into you machine. I am/have/will fake being an IE browser along with the others. You will not be able to tell the difference. Web crawlers can be curtious. — Ed Heal, Jun 01 '12 at 16:46

score 1 · Accepted Answer · answered Jun 01 '12 at 16:45

The get_browser() function attempts to determine the browser's features (in array) but dont count too much on it because of the non standard user-agents; instead, for a serious app, build your own.
the $_SERVER["HTTP_USER_AGENT"] is a long string "describing" the user's browser and can be used as first parameter in the above function (optional); A tip: use this one to uncover user's browser instead of get_browser() itself! Also be prepared for a missing user agent as well! An example of this string is this:
Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
a search engine or robot or spider or crawler that follows the rules will visit your page according to the information stored of robots.txt that must exist in your site's root. Without a robots.txt a spider can crawl the whole site, as long as it find links inside your pages; if you have this file you can program it so to tell the spider what to search; NOTE: this rule applies only to "good" spiders and not the bad ones

`robots.txt` is just a hint, along with sitemaps. – Ed Heal Jun 01 '12 at 17:08 — Ed Heal, Jun 01 '12 at 17:08

Dark · Answer 2 · 2012-06-04T13:50:34.417

get_browser() & $_SERVER['HTTP_USER_AGENT'] will return you the Useragents, it should look like this :

Google :

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Googlebot-Image/1.0

Bing :

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

Yahoo :

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

-> To fully control (and limit) the crawl don't use robots.txt, use .htaccess or http.conf rules. (good crawler don't give a f*** about your disallow rules half of the time in robots.txt)

Detecting a crawl (Search Engine's visit) using PHP

2 Answers2