144

How can one detect the search engine bots using php?

Christian Gollhardt
  • 16,510
  • 17
  • 74
  • 111
terrific
  • 1,637
  • 3
  • 15
  • 14

19 Answers19

268

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

snex
  • 982
  • 11
  • 21
minnur
  • 3,270
  • 5
  • 20
  • 28
  • 3
    Does this assume that bots reveal themselves as such? – Jeromie Devera Jul 13 '14 at 02:43
  • 3
    Vote down, user agent can be changed in chrome settings, firefox, – barwnikk Feb 21 '15 at 22:20
  • 33
    Yes the useragent can be changed, but if someone is changing it to contain "bot","crawl","slurp", or "spider" knows whats coming to them. It also depends on utility. I wouldn't use this to strip out all CSS, but I would use this to not store cookies, ignore location logging, or skip a landing page. – JonShipman Mar 26 '15 at 21:28
  • 2
    Doesn't anyone agree with me that this is a way to wide range to match? – Daan Jun 10 '15 at 07:40
  • I used your function for more than 1 day now and it seems to be working. But I am not sure. How can I send testing bots to test if it works? – FarrisFahad Jun 09 '16 at 00:06
  • 4
    The regex in this answer is nice for being simple and wide-spanning. For my purpose I want to be quick but I don't care if there's a few false positives or false negatives. – Gregory Apr 06 '17 at 11:23
  • 1
    Good solution, I would just add 'Google Page Speed Insights' to the regex - '/bot|crawl|slurp|spider|mediapartners|Google Page Speed Insights/i' – nikksan Aug 07 '17 at 07:56
  • This is only half of verifying, if you want to do it right. The other half is to use DNS to verify the IP. See the answer below: https://stackoverflow.com/a/29457983/64911 – mlissner Nov 08 '17 at 18:41
  • This is good answer, but one note from PHP documentation for preg_match: Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster. – Frodik Mar 08 '20 at 09:48
  • if (preg_match('/http|bot|bingbot|googlebot|robot|spider|slurp|crawler|curl|^$/i', $userAgent)) – MrPHP Aug 25 '20 at 22:30
  • 1
    if you are checking for bot, then you don't need to check for bingbot, googlebot, and robot. – Nick Jun 07 '22 at 16:35
  • Just a small question. What prevents someone from doing a POST request and set headers to be like google bot? thus fooling this check?. – Kosem Jan 06 '23 at 09:25
  • `inspection` might be added to exclude the URL inspection available in Webmaster tools. Ex: `Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.110 Mobile Safari/537.36 (compatible; Google-InspectionTool/1.0;)` – Semra Sep 01 '23 at 14:59
89

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}
Ólafur Waage
  • 68,817
  • 22
  • 142
  • 198
  • if ((eregi("yahoo",$this->USER_AGENT)) && (eregi("slurp",$this->USER_AGENT))) { $this->Browser = "Yahoo! Slurp"; $this->Type = "robot"; } will this work fine?? – terrific Mar 24 '09 at 13:43
  • why strstr and not strpos? – rinchik Mar 19 '13 at 22:04
  • 3
    Because strpos can return 0 (the position), strstr returns FALSE on failure, you can use strpos if you add a !== false check at the end. – Ólafur Waage Mar 19 '13 at 22:34
  • 2
    Erm, `strpos` returns `FALSE` on failure, too. It's faster and more efficient, though (no preprocessing, and no O(m) storage). – Damon Apr 14 '14 at 10:19
  • 7
    What about fake useragents?! –  Sep 11 '14 at 23:14
  • I can change user agent in Chrome. – barwnikk Feb 21 '15 at 22:19
  • I think strpos is better. I do it like this: `(strpos(strtolower($_SERVER['HTTP_USER_AGENT']), 'google') === false)`. I don't do googlebot cause i also wanna detect google insights tests. – The Onin Apr 09 '15 at 20:24
  • 3
    And what if someone could change his user agent with fake name and name it like "Googlebot"? I think checking ip range is more trustworthy! – Mojtaba Rezaeian Jul 01 '15 at 06:39
  • The answer is good but I wouldn't rely on the resource that's being linked to. 'Yahoo' is not even in the list. – Robert Sinclair Jul 10 '16 at 18:24
  • Please do not use this method to identify a google bot! Even on a small scale site we have 403 Agent-IP combinations with "googlebot" in it, while only 126 are real google bots (as of access logs from Feb 2021)! Please use طراحی سایت تهران answer below and see the linked document about verifying a *real* google bot! – boppy Mar 03 '21 at 17:05
  • stristr() does case insensitive on strstr() – Sergio Abreu Jun 06 '21 at 18:40
21

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/useragentstring.php

Or more specifically for crawlers:

http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
ayyyee
  • 5
  • 1
  • 3
Jukka Dahlbom
  • 1,740
  • 2
  • 16
  • 23
17

You can checkout if it's a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
Darren
  • 13,050
  • 4
  • 41
  • 79
macherif
  • 348
  • 2
  • 5
  • 2
    I think this list is outdated, I don't see "slurp" for example which is Yahoo it's spider https://help.yahoo.com/kb/SLN22600.html – Daan Jun 10 '15 at 07:41
17

I'm using this to detect bots:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

mgutt
  • 5,867
  • 2
  • 50
  • 77
15

Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.

The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.

I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier

Eugene Fidelin
  • 2,049
  • 23
  • 22
Fabian Kessler
  • 804
  • 11
  • 12
  • 1
    All the other answers using user-agent strings are only halfway there. Wow. – mlissner Nov 08 '17 at 18:39
  • 2
    There are many comments about user-agent checking only being half the check. This is true, but keep in mind, there's a huge performance impact to doing the full DNS and reverse DNS lookup. It all depends on the level of certainty you need to obtain to support your use case. This is for 100% certainty at the expense of performance. You have to decide what the right balance is (and therefore best solution) for your situation. – Brady Emerson Jan 16 '19 at 21:43
  • 1
    There's no "huge performance impact". First, the reverse dns lookup is only performed on visitors that identify as search engine. All humans are not affected at all. Then, this lookup is only performed once per IP. The result is cached. Search engines keep using the same IP ranges for a very long time, and usually hit one site with one or few IPs only. Also: you could perform the validation delayed. Let the first request through, then background-validate. And if negative, prevent successive requests. (I would advise against this because harvesters have large IP pools now ...) – Fabian Kessler Feb 15 '19 at 07:02
  • Is there some simular library written in PHP? – userlond Jun 25 '19 at 04:11
13

If you really need to detect GOOGLE engine bots you should never rely on "user_agent" or "IP" address because "user_agent" can be changed and acording to what google said in: Verifying Googlebot

To verify Googlebot as the caller:

1.Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.

2.Verify that the domain name is in either googlebot.com or google.com

3.Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.

Here is my tested code :

<?php
$remote_add=$_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($remote_add);
$googlebot = 'googlebot.com';
$google = 'google.com';
if (stripos(strrev($hostname), strrev($googlebot)) === 0 or stripos(strrev($hostname),strrev($google)) === 0 ) 
{
//add your code
}

?>

In this code we check "hostname" which should contain "googlebot.com" or "google.com" at the end of "hostname" which is really important to check exact domain not subdomain. I hope you enjoy ;)

  • 2
    This is the only right answer, when you **absolutely** need to be sure the request is from Google or Googlebot. See the Google documentation [Verifying Googlebot](https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot). – Sjoerd Linders Jun 01 '21 at 13:08
  • For those people trying to verify the Google bot by UA, you guys are fooling yourselves ( and your partners ). Like Sjoerd said, verifying the host is the ONLY correct solution. – Randy Lam Aug 31 '21 at 03:00
9

I use this function ... part of the regex comes from prestashop but I added some more bot to it.

    public function isBot()
{
    $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
    $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )

One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

WonderLand
  • 5,494
  • 7
  • 57
  • 76
8

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

mattab
  • 359
  • 2
  • 6
7

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 2
    IP list is more secure if you want to make sure about user agent name is really a search engine bot, because it is possible to create fake user-agents by name. – Mojtaba Rezaeian Jul 01 '15 at 06:45
6

I made one good and fast function for this

function is_bot(){

        if(isset($_SERVER['HTTP_USER_AGENT']))
        {
            return preg_match('/rambler|abacho|acoi|accona|aspseek|altavista|estyle|scrubby|lycos|geona|ia_archiver|alexa|sogou|skype|facebook|twitter|pinterest|linkedin|naver|bing|google|yahoo|duckduckgo|yandex|baidu|teoma|xing|java\/1.7.0_45|bot|crawl|slurp|spider|mediapartners|\sask\s|\saol\s/i', $_SERVER['HTTP_USER_AGENT']);
        }

        return false;
    }

This cover 99% of all possible bots, search engines etc.

Ivijan Stefan Stipić
  • 6,249
  • 6
  • 45
  • 78
4

100% Working Bot detector. It is working on my website successfully.

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()
Irshad Khan
  • 5,670
  • 2
  • 44
  • 39
4
 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
    $lastupdated = date("Ymd", filemtime(FILE_BOTS));
    if ($lastupdated != date("Ymd")) {
        $lists = array(
        'http://labs.getyacg.com/spiders/google.txt',
        'http://labs.getyacg.com/spiders/inktomi.txt',
        'http://labs.getyacg.com/spiders/lycos.txt',
        'http://labs.getyacg.com/spiders/msn.txt',
        'http://labs.getyacg.com/spiders/altavista.txt',
        'http://labs.getyacg.com/spiders/askjeeves.txt',
        'http://labs.getyacg.com/spiders/wisenut.txt',
        );
        foreach($lists as $list) {
            $opt .= fetch($list);
        }
        $opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
        $fp =  fopen(FILE_BOTS,"w");
        fwrite($fp,$opt);
        fclose($fp);
    }
    $ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
    $ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
    $agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
    $host = strtolower(gethostbyaddr($ip));
    $file = implode(" ", file(FILE_BOTS));
    $exp = explode(".", $ip);
    $class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
    $threshold = CLOAKING_LEVEL;
    $cloak = 0;
    if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
        $cloak++;
    }
    if (stristr($file, $class)) {
        $cloak++;
    }
    if (stristr($file, $agent)) {
        $cloak++;
    }
    if (strlen($ref) > 0) {
        $cloak = 0;
    }

    if ($cloak >= $threshold) {
        $cloakdirective = 1;
    } else {
        $cloakdirective = 0;
    }
}
?>

That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com

Needs a bit of work, but definitely the way to go.

L. Cosio
  • 227
  • 3
  • 9
2

For Google i'm using this method.

function is_google() {
    $ip   = $_SERVER['REMOTE_ADDR'];
    $host = gethostbyaddr( $ip );
    if ( strpos( $host, '.google.com' ) !== false || strpos( $host, '.googlebot.com' ) !== false ) {

        $forward_lookup = gethostbyname( $host );

        if ( $forward_lookup == $ip ) {
            return true;
        }

        return false;
    } else {
        return false;
    }

}

var_dump( is_google() );

Credits: https://support.google.com/webmasters/answer/80553

Mike Aron
  • 550
  • 5
  • 13
1

I'm using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.

$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){
    // if not meet the conditions then
    // do what you need

    // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
    if($user_agent!=""){
        $myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!");
        fwrite($myfile, $user_agent);
        $user_agent = "\n";
        fwrite($myfile, $user_agent);
        fclose($myfile);
    }
}

This is the content of useragent.txt

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
  • What would be your (if_clause ) string piece for this? mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1 – Average Joe Jan 24 '19 at 15:17
0

Verifying Googlebot

As useragent can be changed...

the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain.

Taken from here.

so you must run a DNS lookup

Both, reverse and forward.

See this guide on Google Search Central.

yanntinoco
  • 152
  • 7
0

Here is what i use:

function is_bot() {
       if(preg_match('/bot|crawl|spider|mediapartners|slurp|patrol/i', $_SERVER['HTTP_USER_AGENT'])) {
          return true;
       }
       if (strpos($_SERVER['HTTP_USER_AGENT'], 'Headless') !== false) {
          return true;
       }
       return false;
}
tomcaptain
  • 11
  • 2
-1
function bot_detected() {

  if(preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']){
    return true;
  }
  else{
    return false;
  }
}
Elyor
  • 5,396
  • 8
  • 48
  • 76
-1

might be late, but what about a hidden a link. All bots will use the rel attribute follow, only bad bots will use the nofollow rel attribute.

<a style="display:none;" rel="follow" href="javascript:void(0);" onclick="isabot();">.</a>

function isabot(){
//define a variable to pass with ajax to php
// || send bots info direct to where ever.
isabot = true;
}

for a bad bot you can use this:

<a style="display:none;" href="javascript:void(0);" rel="nofollow" onclick="isBadbot();">.</a>

for PHP specific you can remove the onclick attribute and replace the href attribute with a link to your ip detector/ bot detector like so:

<a style="display:none;" rel="follow" href="https://somedomain.com/botdetector.php">.</a>

OR

<a style="display:none;" rel="nofollow" href="https://somedomain.com/badbotdetector.php">.</a>

you can work with it and maybe use both, one detects a bot, while the other proves it to be a bad bot.

hope you find this useful