8

So far I am able to detect robots from a list of user agent string by matching these strings to known user agents, but I was wondering what other methods there are to do this using php as I am retrieving fewer bots than expected using this method.

I am also looking to find out how to detect if a browser or robot is spoofing another browser using a user agent string.

Any advice is appreciated.

EDIT: This has to be done using a log file with lines as follows:

129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca/loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"

This means I can't check user behaviour aside from access times.

eltabo
  • 3,749
  • 1
  • 21
  • 33
user1422508
  • 99
  • 1
  • 2
  • 6
  • 3
    Unfortunately, no matter how hard you try, bots will still get by whatever you manage to implement for this. – noko Nov 14 '12 at 04:01
  • 2
    It's not possible. You can look at it heuristically, but that's it. – Brad Nov 14 '12 at 04:04

5 Answers5

12

In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.

laifukang
  • 311
  • 2
  • 5
  • 1
    this technique works quite well for catching spammers by having a input type="hidden" named email and making your real visible email form field called something else. The only downside to a hidden links is it might get flagged as blackhat SEO by Google – WebChemist Nov 14 '12 at 04:17
  • As @WebChemist said, this is dangerous. We live in a world where intelligent and otherwise "correct" solutions are often 'wrong' solutions because: Google. Be very careful with hidden links. – Bangkokian Aug 27 '15 at 08:50
6

Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.

I work for a security company and our bot detection algorithm look something like this:

  1. Step 1 - Gathering data:

    a. Cross-Check user-agent vs IP. (both need to be right)

    b. Check Header parameters (what is missing, what is the order and etc...)

    c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)

  2. Step 2 - Classification:

    By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"

  3. Step 3 - Active Challenges:

    Suspicious bots undergo the following challenges:

    a. JS Challenge (can it activate JS?)

    b. Cookie Challenge (can it accept coockies?)

    c. If still not conclusive -> CAPTCHA

This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).

We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.

There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).

GL

Igal Zeifman
  • 1,146
  • 7
  • 8
4

Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.

However, beware, you may well accidently get some people who are genuinely people.

Kyros
  • 512
  • 2
  • 5
  • 1
    I should clarify; I have to do this using a log file of user agents thus I can't check for javascript or loaded images, but thanks for the help – user1422508 Nov 14 '12 at 04:10
  • Then you need to post the log, otherwise I have no idea what information you have to work with. – Kyros Nov 14 '12 at 04:11
  • Original post has been edited with example of a line from the log file, the actual file consists of over 70000 lines but they are in similar structure to this. – user1422508 Nov 14 '12 at 04:24
2

No, user agents can be spoofed so they are not to be trusted.

In addition to checking for Javascript or image/css loads, you can also measure pageload speed as bots will usually crawl your site a lot faster than any human visitor would jump around. But this only works for small sites, popular sites that would have a lot of visitors behind a shared external IP address (large corporation or university campus) might hit your site at bot-like rates.

I suppose you could also measure the order in which they load as bots would crawl in a first come first crawl order where as human users would usually not fit that pattern, but thats a bit more complicated to track

WebChemist
  • 4,393
  • 6
  • 28
  • 37
  • No problem, here's a post I helped another user with on making a blocking script to stop excessive bot pageloads that you might be able to adjust to your needs http://webmasters.stackexchange.com/questions/35171/number-of-page-requests-by-any-bot-in-5-secs – WebChemist Nov 14 '12 at 04:24
1

Your question specifically relates to detection using the user agent string. As many have mentioned this can be spoofed.

To understand what is possible in spoofing, and to see how difficult it is to detect, you are probably best advised to learn the art in PHP using cURL.

In essence using cURL almost everything that can be sent in a browser(client) request can be spoofed with the notable exception of the IP, but even here a determined spoofer will also hide themselves behind a proxy server to eliminate your detecting their IP.

It goes without saying that using the same parameters each time a request is made will enable a spoofer to be detected, but rotating with different parameters will make it very difficult, if not impossible to detect any spoofers amongst genuine traffic logs.

T9b
  • 3,312
  • 5
  • 31
  • 50