how to detect search engine bots with php?

Question

How can one detect the search engine bots using php?

score 268 · Answer 1 · edited Jun 17 '17 at 18:49

268

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

edited Jun 17 '17 at 18:49

snex

982
11
21

answered Feb 24 '13 at 01:52

minnur

3,270
5
20
28

3

Does this assume that bots reveal themselves as such? – Jeromie Devera Jul 13 '14 at 02:43
3

Vote down, user agent can be changed in chrome settings, firefox, – barwnikk Feb 21 '15 at 22:20
33

Yes the useragent can be changed, but if someone is changing it to contain "bot","crawl","slurp", or "spider" knows whats coming to them. It also depends on utility. I wouldn't use this to strip out all CSS, but I would use this to not store cookies, ignore location logging, or skip a landing page. – JonShipman Mar 26 '15 at 21:28
2

Doesn't anyone agree with me that this is a way to wide range to match? – Daan Jun 10 '15 at 07:40
I used your function for more than 1 day now and it seems to be working. But I am not sure. How can I send testing bots to test if it works? – FarrisFahad Jun 09 '16 at 00:06
4

The regex in this answer is nice for being simple and wide-spanning. For my purpose I want to be quick but I don't care if there's a few false positives or false negatives. – Gregory Apr 06 '17 at 11:23
1

Good solution, I would just add 'Google Page Speed Insights' to the regex - '/bot|crawl|slurp|spider|mediapartners|Google Page Speed Insights/i' – nikksan Aug 07 '17 at 07:56
This is only half of verifying, if you want to do it right. The other half is to use DNS to verify the IP. See the answer below: https://stackoverflow.com/a/29457983/64911 – mlissner Nov 08 '17 at 18:41
This is good answer, but one note from PHP documentation for preg_match: Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster. – Frodik Mar 08 '20 at 09:48
if (preg_match('/http|bot|bingbot|googlebot|robot|spider|slurp|crawler|curl|^$/i', $userAgent)) – MrPHP Aug 25 '20 at 22:30
1

if you are checking for bot, then you don't need to check for bingbot, googlebot, and robot. – Nick Jun 07 '22 at 16:35
Just a small question. What prevents someone from doing a POST request and set headers to be like google bot? thus fooling this check?. – Kosem Jan 06 '23 at 09:25
`inspection` might be added to exclude the URL inspection available in Webmaster tools. Ex: `Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.110 Mobile Safari/537.36 (compatible; Google-InspectionTool/1.0;)` – Semra Sep 01 '23 at 14:59

Ólafur Waage · Accepted Answer · 2009-03-24T13:57:15.003

89

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

edited Mar 24 '09 at 13:57

answered Mar 24 '09 at 13:37

Ólafur Waage

68,817
22
142
198

if ((eregi("yahoo",$this->USER_AGENT)) && (eregi("slurp",$this->USER_AGENT))) { $this->Browser = "Yahoo! Slurp"; $this->Type = "robot"; } will this work fine?? – terrific Mar 24 '09 at 13:43
why strstr and not strpos? – rinchik Mar 19 '13 at 22:04
3

Because strpos can return 0 (the position), strstr returns FALSE on failure, you can use strpos if you add a !== false check at the end. – Ólafur Waage Mar 19 '13 at 22:34
2

Erm, `strpos` returns `FALSE` on failure, too. It's faster and more efficient, though (no preprocessing, and no O(m) storage). – Damon Apr 14 '14 at 10:19
7

What about fake useragents?! – Sep 11 '14 at 23:14
I can change user agent in Chrome. – barwnikk Feb 21 '15 at 22:19
I think strpos is better. I do it like this: `(strpos(strtolower($_SERVER['HTTP_USER_AGENT']), 'google') === false)`. I don't do googlebot cause i also wanna detect google insights tests. – The Onin Apr 09 '15 at 20:24
3

And what if someone could change his user agent with fake name and name it like "Googlebot"? I think checking ip range is more trustworthy! – Mojtaba Rezaeian Jul 01 '15 at 06:39
The answer is good but I wouldn't rely on the resource that's being linked to. 'Yahoo' is not even in the list. – Robert Sinclair Jul 10 '16 at 18:24
Please do not use this method to identify a google bot! Even on a small scale site we have 403 Agent-IP combinations with "googlebot" in it, while only 126 are real google bots (as of access logs from Feb 2021)! Please use طراحی سایت تهران answer below and see the linked document about verifying a *real* google bot! – boppy Mar 03 '21 at 17:05
stristr() does case insensitive on strstr() – Sergio Abreu Jun 06 '21 at 18:40

score 21 · Answer 3 · edited Aug 12 '19 at 10:21

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/useragentstring.php

Or more specifically for crawlers:

http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

score 17 · Answer 4 · edited Jul 17 '15 at 06:44

You can checkout if it's a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

I think this list is outdated, I don't see "slurp" for example which is Yahoo it's spider https://help.yahoo.com/kb/SLN22600.html — Daan, Jun 10 '15 at 07:41

mgutt · Answer 5 · 2017-04-18T16:31:37.150

I'm using this to detect bots:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

you forgot a closing `)` in your first piece of code. – Ludo - Off the record Apr 18 '17 at 11:59 — Ludo - Off the record, Apr 18 '17 at 11:59

score 15 · Answer 6 · edited May 06 '15 at 15:29

15

Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.

The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.

I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier

edited May 06 '15 at 15:29

Eugene Fidelin

2,049
23
22

answered Apr 05 '15 at 13:48

Fabian Kessler

804
11
12

1

All the other answers using user-agent strings are only halfway there. Wow. – mlissner Nov 08 '17 at 18:39
2

There are many comments about user-agent checking only being half the check. This is true, but keep in mind, there's a huge performance impact to doing the full DNS and reverse DNS lookup. It all depends on the level of certainty you need to obtain to support your use case. This is for 100% certainty at the expense of performance. You have to decide what the right balance is (and therefore best solution) for your situation. – Brady Emerson Jan 16 '19 at 21:43
1

There's no "huge performance impact". First, the reverse dns lookup is only performed on visitors that identify as search engine. All humans are not affected at all. Then, this lookup is only performed once per IP. The result is cached. Search engines keep using the same IP ranges for a very long time, and usually hit one site with one or few IPs only. Also: you could perform the validation delayed. Let the first request through, then background-validate. And if negative, prevent successive requests. (I would advise against this because harvesters have large IP pools now ...) – Fabian Kessler Feb 15 '19 at 07:02
Is there some simular library written in PHP? – userlond Jun 25 '19 at 04:11

score 13 · Answer 7 · answered Jul 02 '20 at 10:16

If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot

To verify Googlebot as the caller:

1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.

2.Verify that the domain name is in either googlebot.com or google.com

3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.

Here is my tested code :

<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 ) 
{
//add your code
}

?>

In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain. I hope you enjoy ;)

This is the only right answer, when you **absolutely** need to be sure the request is from Google or Googlebot. See the Google documentation [Verifying Googlebot](https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot). — Sjoerd Linders, Jun 01 '21 at 13:08
For those people trying to verify the Google bot by UA, you guys are fooling yourselves ( and your partners ). Like Sjoerd said, verifying the host is the ONLY correct solution. — Randy Lam, Aug 31 '21 at 03:00

score 9 · Answer 8 · answered Feb 02 '18 at 06:50

I use this function ... part of the regex comes from prestashop but I added some more bot to it.

    public function isBot()
{
    $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
    $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )

One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

score 8 · Answer 9 · answered Apr 28 '15 at 12:02

8

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

answered Apr 28 '15 at 12:02

mattab

359
2
6

Note: This library only analyzes the user agent to decide if visitor is a bot. – Philipp Jul 23 '15 at 19:40
Too heavy, just to check a verify bot. – Joel James Jun 03 '16 at 11:46

Gumbo · Answer 10 · 2009-03-24T14:13:13.917

7

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

edited Mar 24 '09 at 14:13

answered Mar 24 '09 at 13:37

Gumbo

643,351
109
780
844

2

IP list is more secure if you want to make sure about user agent name is really a search engine bot, because it is possible to create fake user-agents by name. – Mojtaba Rezaeian Jul 01 '15 at 06:45

Ivijan Stefan Stipić · Answer 11 · 2019-10-04T13:31:27.810

I made one good and fast function for this

function is_bot(){

        if(isset($_SERVER['HTTP_USER_AGENT']))
        {
            return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
        }

        return false;
    }

This cover 99% of all possible bots, search engines etc.

score 4 · Answer 12 · answered Feb 04 '20 at 10:12

100% Working Bot detector. It is working on my website successfully.

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()

This regex flags all Motorola phones as bots – Andreas Huttenrauch Nov 03 '22 at 14:47 — Andreas Huttenrauch, Nov 03 '22 at 14:47

score 4 · Answer 13 · answered Mar 24 '09 at 18:22

 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
    $lastupdated = date("Ymd", filemtime(FILE_BOTS));
    if ($lastupdated != date("Ymd")) {
        $lists = array(
        'http://labs.getyacg.com/spiders/google.txt',
        'http://labs.getyacg.com/spiders/inktomi.txt',
        'http://labs.getyacg.com/spiders/lycos.txt',
        'http://labs.getyacg.com/spiders/msn.txt',
        'http://labs.getyacg.com/spiders/altavista.txt',
        'http://labs.getyacg.com/spiders/askjeeves.txt',
        'http://labs.getyacg.com/spiders/wisenut.txt',
        );
        foreach($lists as $list) {
            $opt .= fetch($list);
        }
        $opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
        $fp =  fopen(FILE_BOTS,"w");
        fwrite($fp,$opt);
        fclose($fp);
    }
    $ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
    $ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
    $agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
    $host = strtolower(gethostbyaddr($ip));
    $file = implode(" ", file(FILE_BOTS));
    $exp = explode(".", $ip);
    $class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
    $threshold = CLOAKING_LEVEL;
    $cloak = 0;
    if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
        $cloak++;
    }
    if (stristr($file, $class)) {
        $cloak++;
    }
    if (stristr($file, $agent)) {
        $cloak++;
    }
    if (strlen($ref) > 0) {
        $cloak = 0;
    }

    if ($cloak >= $threshold) {
        $cloakdirective = 1;
    } else {
        $cloakdirective = 0;
    }
}
?>

That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com

Needs a bit of work, but definitely the way to go.

score 2 · Answer 14 · answered Apr 24 '20 at 01:07

For Google i'm using this method.

function is_google() {
    $ip   = $_SERVER['REMOTE_ADDR'];
    $host = gethostbyaddr( $ip );
    if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {

        $forward_lookup = gethostbyname( $host );

        if ( $forward_lookup == $ip ) {
            return true;
        }

        return false;
    } else {
        return false;
    }

}

var_dump( is_google() );

Credits: https://support.google.com/webmasters/answer/80553

score 1 · Answer 15 · answered Oct 12 '16 at 02:04

I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.

$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
    // if not meet the conditions then
    // do what you need

    // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
    if($user_agent!=""){
        $myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
        fwrite($myfile, $user_agent);
        $user_agent = "\n";
        fwrite($myfile, $user_agent);
        fclose($myfile);
    }
}

This is the content of useragent.txt

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

What would be your (if_clause ) string piece for this? mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1 — Average Joe, Jan 24 '19 at 15:17

score 0 · Answer 16 · answered Sep 17 '21 at 19:43

Verifying Googlebot

As useragent can be changed...

the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain.

Taken from here.

so you must run a DNS lookup

Both, reverse and forward.

See this guide on Google Search Central.

score 0 · Answer 17 · answered Apr 09 '23 at 20:40

Here is what i use:

function is_bot() {
       if(preg_match('/bot|crawl|spider|mediapartners|slurp|patrol/i', $_SERVER['HTTP_USER_AGENT'])) {
          return true;
       }
       if (strpos($_SERVER['HTTP_USER_AGENT'], 'Headless') !== false) {
          return true;
       }
       return false;
}

score -1 · Answer 18 · answered Jan 25 '18 at 05:20

-1

function bot_detected() {

  if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
    return true;
  }
  else{
    return false;
  }
}

answered Jan 25 '18 at 05:20

Elyor

5,396
8
48
76

Antony Falencikowski · Answer 19 · 2021-03-11T11:32:03.440

might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.

<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>

function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}

for a bad bot you can use this:

<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>

for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:

<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>

OR

<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>

you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.

hope you find this useful

how to detect search engine bots with php?

19 Answers19

Linked

Related