63

I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.

I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler:

/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x

Example of search crawlers I want to block:

Google 
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 
Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
Googlebot/2.1 (+http://www.google.com/bot.html) 

Baidu 
Baiduspider+(+http://www.baidu.com/search/spider_jp.html) 
Baiduspider+(+http://www.baidu.com/search/spider.htm) 
BaiDuSpider 
Jon
  • 8,205
  • 25
  • 87
  • 146
  • 2
    Do you just want robots to not crawl your site? Use a `robots.txt` file. Anything that will play nice enough to tell you it's a bot will probably respect `robots.txt`. – user2357112 Nov 19 '13 at 23:38
  • 4
    I want the robot to crawl my site. I just want to suppress certain JavaScript calls if it is a robot. – Jon Nov 19 '13 at 23:39
  • 1
    Why bother? I doubt they'll even run your Javascript, and if they do, it'll be heavily sandboxed in ways that will probably prevent it from affecting anything you care about. – user2357112 Nov 19 '13 at 23:46
  • As @user2357112 stated: you can't detect the bots, as they never run the Javascript (at least they don't do what you think they'll do). Most probably, you want to block them from running *visible* Ajax requests. I mean, you have a code like «do a call on /ajax.html», and the bots are calling /ajax.html directly. Your only solution to cope with such behaviours is to encode your urls in your javascript (obfuscate and the like). But whatever it is, you're doing something wrong IMHO. You may be red-flagged on your SEO, as don't serve thing the same way for bots and humans. – Yvan Dec 26 '13 at 22:15
  • 3
    Recently, Googlebot has indeed begun executing Javascript, with some limitations. – troelskn May 13 '14 at 13:21
  • 1
    @Jon echoed something I was recently wondering about myself. I want to redirect the user to an Angular.js backed interface/page if it is possible to deduce from the user string whether the visitor is a bot or an actual browser. If it is a bot, then I want the conventional web pages to be crawled. Otherwise, redirect to a page that the user needs to see first before visiting the conventional pages. Since bots may be capable of executing JavaScript (to whatever degree), I prefer the bot does not even encounter a redirect to the Angular.js page. – Web User Aug 22 '15 at 10:46
  • Take a look at this library: https://www.npmjs.com/package/isbot – Kamran Taghaddos Jul 02 '22 at 12:41

9 Answers9

62

This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:

/bot|crawler|spider|crawling/i

For example you have some object, util.browser, you can store what type of device a user is on:

util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}
Janne Annala
  • 25,928
  • 8
  • 31
  • 41
megawac
  • 10,953
  • 5
  • 40
  • 61
  • 1
    Cool, thank you. I am curious about my requirements for Google. On my second line, I am to block out `Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)`. I am wondering what that means? Shouldn't Mozilla be one of the Regexp I should be including in my code? – Jon Nov 19 '13 at 23:58
  • @icu222much see http://stackoverflow.com/questions/5125438/why-do-chrome-and-ie-put-mozilla-5-0-in-the-user-agent-they-send-to-the-server You should just match if the string contains bot/spider/etc to check if a ua is a bot – megawac Nov 20 '13 at 00:00
  • I tried `if (/YahooSeeker|/.test(navigator.userAgent)) {console.log('yahoo')}` and I left my UA as default (Mozilla) but the `if` statement returned true. Am I doing something incorrectly? – Jon Nov 20 '13 at 17:38
  • 1
    you have an extraneous `|` (or statement) in your regex so that test will always pass. Try `/YahooSeeker/` – megawac Nov 20 '13 at 17:39
  • I have removed the extra pipe so my statement now says `if (/Googlebot/.test(navigator.userAgent)) {...}` but is now reporting false even when I am using Googlebot as my UA. – Jon Nov 20 '13 at 17:51
  • The googlebot ua is `Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)` so try `/Googlebot/i.test("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")`. You were missing the `i` flag – megawac Nov 20 '13 at 17:53
  • Sorry, I don't mean to sound noob-ish, but it is still not working. I have `if ( /Googlebot/i.test("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") )` which is always returning true even though I have disabled my UA. Can we move this into a chat? – Jon Nov 20 '13 at 18:07
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/41559/discussion-between-megawac-and-icu222much) – megawac Nov 20 '13 at 18:10
  • 17
    `googlebot` and `robot` are redundant in the regex string used since `bot` will match first. `/bot|crawler|spider|crawling/i` would be much simpler. – tiernanx Jul 29 '16 at 20:03
  • 2
    Now that navigator.userAgent is deprecated what would be the preferred way to do it on javascript. – Hariom Balhara Feb 22 '17 at 06:29
  • 5
    You can simplify it even further by combining `crawler` and `crawling` into `crawl`: `/bot|crawl|spider/i` – tzazo Jun 22 '20 at 16:17
37

Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents

var botPattern = "(googlebot\/|bot|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = navigator.userAgent; 
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}
RobKohr
  • 6,611
  • 7
  • 48
  • 69
Sergey P. aka azure
  • 3,993
  • 1
  • 29
  • 23
20

The following regex will match the biggest search engines according to this post.

/bot|google|baidu|bing|msn|teoma|slurp|yandex/i
    .test(navigator.userAgent)

The matches search engines are:

  • Baidu
  • Bingbot/MSN
  • DuckDuckGo (duckduckbot)
  • Google
  • Teoma
  • Yahoo!
  • Yandex

Additionally, I've added bot as a catchall for smaller crawlers/bots.

Edo
  • 3,311
  • 1
  • 24
  • 25
  • 2
    **`aolbuild` is not a bot**. We removed it from our regex today because multiple customers called and complained about being flagged as a bot. perishablepress.com is incorrect about `aolbuild`. – rocky May 31 '17 at 23:32
  • Thanks @rocky, I've removed aolbuild from the answer – Edo Jun 01 '17 at 10:32
  • there is also facebook crawler bots facebookexternalhit|facebot https://developers.facebook.com/docs/sharing/webmasters/crawler – Amir Bar Aug 17 '17 at 09:52
  • duckduckgo should be: duckduckbot (see: https://duckduckgo.com/duckduckbot) – dave Apr 23 '18 at 23:23
  • Thanks @dave, edited. Funnily enough, perishablepress.com lists the correct user agent string, but the regex they suggest is wrong. – Edo Apr 24 '18 at 07:52
  • 2
    duckduckbot is redundant by "bot" `/bot|google|baidu|bing|msn|teoma|slurp|yandex/i` – Omri May 27 '20 at 09:52
8

This might help to detect the robots user agents while also keeping things more organized:

Javascript

const detectRobot = (userAgent) => {
  const robots = new RegExp([
    /bot/,/spider/,/crawl/,                            // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,        // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,       // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                         // OTHER
  ].map((r) => r.source).join("|"),"i");               // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Typescript

const detectRobot = (userAgent: string): boolean => {
  const robots = new RegExp(([
    /bot/,/spider/,/crawl/,                               // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                   // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,           // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,          // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                            // OTHER
  ] as RegExp[]).map((r) => r.source).join("|"),"i");     // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Use on server:

const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);

Use on "client" / some phantom browser a bot might be using:

const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);

Overview of Google crawlers:

https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

cbdeveloper
  • 27,898
  • 37
  • 155
  • 336
2

isTrusted property could help you.

The isTrusted read-only property of the Event interface is a Boolean that is true when the event was generated by a user action, and false when the event was created or modified by a script or dispatched via EventTarget.dispatchEvent().

eg:

isCrawler() {
  return event.isTrusted;
}

⚠ Note that IE isn't compatible.

Read more from doc: https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted

Emeric
  • 6,315
  • 2
  • 41
  • 54
2

People might light to check out the new navigator.webdriver property, which allows bots to inform you that they are bots:

https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver

The webdriver read-only property of the navigator interface indicates whether the user agent is controlled by automation.

It defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example, so that alternate code paths can be triggered during automation.

It is supported by all major browsers and respected by major browser automation software like Puppeteer. Users of automation software can of course disable it, and so it should only be used to detect "good" bots.

joe
  • 3,752
  • 1
  • 32
  • 41
1

I combined some of the above and removed some redundancy. I use this in .htaccess on a semi-private site:

(google|bot|crawl|spider|slurp|baidu|bing|msn|teoma|yandex|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|biglotron|convera|gigablast|archive|webmon|httrack|grub|netresearchserver|speedy|fluffy|bibnum|findlink|panscient|IOI|ips-agent|yanga|Voyager|CyberPatrol|postrank|page2rss|linkdex|ezooms|heritrix|findthatfile|Aboundex|summify|ec2linkfinder|facebook|slack|instagram|pinterest|reddit|twitter|whatsapp|yeti|RetrevoPageAnalyzer|sogou|wotbox|ichiro|drupact|coccoc|integromedb|siteexplorer|proximic|changedetection|WeSEE|scrape|scaper|g00g1e|binlar|indexer|MegaIndex|ltx71|BUbiNG|Qwantify|lipperhey|y!j-asr|AddThis)

Wes Reimer
  • 308
  • 3
  • 8
0

The "test for MSIE x.x" example is just code for testing the userAgent against a Regular Expression. In your example the Regexp is the

/MSIE (\d+\.\d+);/

part. Just replace it with your own Regexp you want to test the user agent against. It would be something like

/Google|Baidu|Baiduspider/.test(navigator.userAgent)

where the vertical bar is the "or" operator to match the user agent against all of your mentioned robots. For more information about Regular Expression you can refer to this site since javascript uses perl-style RegExp.

morten.c
  • 3,414
  • 5
  • 40
  • 45
  • Cool, thank you. I am curious about my requirements for Google. On my second line, I am to block out `Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)`. I am wondering what that means? Shouldn't Mozilla be one of the Regexp I should be including in my code? – Jon Nov 19 '13 at 23:58
  • I thought you just don't know how to match the user agent against you're list, so stick to the answer/comment of megawac, I don't have much expirience identifying bots/crawler. So +1 for his answer. – morten.c Nov 20 '13 at 01:09
  • I tried `if (/YahooSeeker|/.test(navigator.userAgent)) {console.log('yahoo')}` and I left my user-agent as default (Mozilla) but the `if` statement returned true. Am I doing something incorrectly? – Jon Nov 20 '13 at 17:36
  • There is again a pipe too much at the end of your RegEx, change it to "/YahooSeeker/" should solve this issue. – morten.c Dec 03 '13 at 00:11
0

I found this isbot package that has the built-in isbot() function. It seams to me that the package is properly maintained and that they keep everything up-to-date.

USAGE:

const isBot = require('isbot');

...

isBot(req.get('user-agent'));

Package: https://www.npmjs.com/package/isbot

NeNaD
  • 18,172
  • 8
  • 47
  • 89