57

I am looking to roll my own simple web stats script.

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.

Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Can I ask you why do you want to make your own? It can add a great amount of extra stress to your servers (write ops). What is it that GA can't track for you? – gblazex Dec 17 '10 at 16:59
  • @galambalazs I don't want to use GA in this scenario. The goal is to have a completely self-contained solution. – Pekka Dec 17 '10 at 17:00
  • I understand **what** your goal is. I'm just curious about **why**? :) – gblazex Dec 17 '10 at 17:56
  • 1
    @galambalasz the site I want to do this for is for a group of people who are not very technical minded. GA with its thousands of bells and whistles is too complicated for them. What they need to know is 1) the total number of visitors of the day and 2) a list of where approximately they come from. I think there's a demand for such simple solutions that GA is not addressing simply because it's so *complex*. However, with the [GA API](http://stackoverflow.com/questions/2374032), it's now possible to fetch and display data in a custom way – Pekka Dec 17 '10 at 18:03
  • 1
    so the argument is not as valid anymore as it was when I asked the question. But even apart from that, I sometimes have the desire to reduce dependency from 3rd party providers, especially for projects that will not undergo frequent technical development and maintenance. There are things that can go wrong with a hosted service - [technical outages](http://stackoverflow.com/questions/4471568/why-is-jquery-tools-cdn-link-pointing-to-an-ad), possible license changes, bankruptcy... It's all been there, even for the biggest and most mighty of companies – Pekka Dec 17 '10 at 18:05
  • Do you want to gather stats via some way of hooking into the page-view (JS a la Google Analytics, or invisible 1x1 pixel hit-counter png) or by processing your server logs offline after the fact? Or either/both? – Day Dec 20 '10 at 18:11

15 Answers15

72

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

Community
  • 1
  • 1
Jeff Ferland
  • 17,832
  • 7
  • 46
  • 76
  • 1
    I don't like the `save IP address` part – Daniel W. Oct 02 '13 at 14:52
  • 4
    @JeffFerland in times of massive NSA spying, we need trusted sites that don't save the IP at all – Daniel W. Oct 04 '13 at 07:33
  • About 2% of the population are blind. Among older people the percentage rises to about 6%. Blind people often surf the web using browsers that do not load images, style sheets, or JavaScript. If your site is in any way interesting to blind people, please do not forget them when you analyse your logfiles. Thank you. –  Dec 22 '18 at 12:00
21

The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.

Yacoby
  • 54,544
  • 15
  • 116
  • 120
  • Hmm. Could it be that easy? But then, there are user agents like wget or getleft that would be nice to recognize as well. Still - +1 – Pekka Nov 11 '09 at 18:20
  • 4
    The legitimate ones do. The bad ones (e.g., email harvesters) will just hijack a useragent string from a web browser. – Bob Kaufman Nov 11 '09 at 18:22
  • 1
    And the ones that don't probably doesn't want you to know they are a bot anyways. – Svish Nov 11 '09 at 18:27
13

EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.

You can follow a bot list and add their user-agent to the filtering list.

Take a look at this bot list.

This user-agent list is also pretty good. Just strip out all the B's and you're set.

EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.

Hope it helps!

Frankie
  • 24,627
  • 10
  • 79
  • 121
  • Depending on the market you are aiming at, neither do a lot of users. A lot of firefox users tend to use NoScript. – Yacoby Nov 11 '09 at 18:25
  • The bot lists look good. Maybe a combined JS / botlist solution, with a frequent list update, is the way to go. Cheers! – Pekka Nov 11 '09 at 18:33
  • 15
    NoScript also means, no StackOverflow, no Gmail, Reader, Maps, Facebook, YouTube and so on... I use NoScript all the time to check my own sites for spiders and bots, but nowadays doesn't make much sense to use NoScript. Just my opinion. – Frankie Nov 11 '09 at 18:34
  • 5
    @Col. It's just like Jeff puts it, always trying to suck a bit less... re-read it yesterday and though the comma would make it easier to read! :) – Frankie Jun 20 '10 at 16:59
  • 1
    BTW, here is the above list Robotstxt but in a form that can be queried and parsed easier. http://www.robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use. – eSniff Dec 22 '10 at 19:53
  • 1
    This answer is definitely outdated. Now more and more bots are using something like headless chrome which will execute everything just like chrome does when human use it. It was launched in mid of 2017. Also, Firefox can run in headless mode and probably other browsers too already or will run in the future. JS is not an issue. Too many sites depend on JS just to render anything. Bots know that. – Lukas Liesis Jan 30 '19 at 11:50
  • @Lukas, definitely! 10 year's a long time... gonna edit to point to your comment. – Frankie Jan 31 '19 at 17:25
11

Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).

Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).

To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
9

I use the following for my stats/counter app:

<?php
    function is_bot($user_agent) {
        return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
    }

    //example usage
    if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>

I removed a link to the original code source, because it now redirects to a food app.

chimeraha
  • 421
  • 5
  • 7
5

Checking the user-agent will alert you to the honest bots, but not the spammers.

To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .

If the focus event fires, the page was almost certainly loaded by a human being.

sanderr
  • 94
  • 5
TehShrike
  • 9,855
  • 2
  • 33
  • 28
4

I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.

You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html

Hope that helps, RayQuang

  • 1
    The 1670 line file that awstats uses to look up bots from user agent string is http://awstats.cvs.sourceforge.net/viewvc/awstats/awstats/wwwroot/cgi-bin/lib/robots.pm?view=markup Scary – Day Dec 20 '10 at 18:48
  • I'm with you Ray, AWstats is fine by me – MikeAinOz Dec 23 '10 at 03:24
3

Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)

Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).

Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!

Kristen
  • 4,227
  • 2
  • 29
  • 36
2

Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.

neoneye
  • 50,398
  • 25
  • 166
  • 151
2

Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.

I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.

I think this may be the closest answer to real detection if it's human or not by no action from the visitor:

https://developers.google.com/recaptcha/docs/invisible

I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.

while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.

Lukas Liesis
  • 24,652
  • 10
  • 111
  • 109
1

Prerequisite - referrer is set

apache level:

LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*)   /b.gif [L]
SetEnv human_session 0

# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1

SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog

In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.

At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.

To play safe, hash of the referrer from apache log should tally with the image name

ajreal
  • 46,720
  • 11
  • 89
  • 119
  • this is likely to catch a lot (although I think, not all bots) and taught me something about custom logging. +1 thanks! – Pekka Dec 23 '10 at 11:22
0

=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.

Original answer follows (getting negative ratings!)

The only reliable way to tell bots from humans are [CAPTCHAS][1]. You can use [reCAPTCHA][2] if it suits you.

[1]: http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/

Ast Derek
  • 2,739
  • 1
  • 20
  • 28
  • See my clarification in the question above. – Pekka Nov 11 '09 at 18:18
  • =? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically. – Ast Derek Nov 11 '09 at 18:34
  • Nice idea, have not heard of that before! :) – Pekka Nov 11 '09 at 19:34
  • You could call it a honeypot: http://www.slightlyshadyseo.com/index.php/dynamic-crawler-identification-101-trapping-the-bots/ – Frank Farmer Nov 11 '09 at 21:06
  • 1
    I called it HoneyPot www.magentaderek.com/guestbook/ – Ast Derek Nov 12 '09 at 02:32
  • This was my idea as well but by asking around, I found CS/ SE students writing bots that can read CAPTCHA. – kiwicptn Dec 22 '10 at 15:00
0

Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.

neoneye
  • 50,398
  • 25
  • 166
  • 151
0

You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.

You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.

So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.

Community
  • 1
  • 1
Day
  • 9,465
  • 6
  • 57
  • 93
-1

I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.

A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR

These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.

Here's some more background

Community
  • 1
  • 1
MTS
  • 55
  • 5
  • 1
    from the question (half a year old, BTW): `To clarify: I'm not looking to block bots`. – Your Common Sense Jun 20 '10 at 04:58
  • I was just being jokey. I thought people might enjoy PARRY and the DOCTOR. It's pretty hilarious, especially that it was published as an RFC. – MTS Jun 20 '10 at 05:59