14

Time goes by, but still no perfect solution... See if someone has a bright idea to differentiate bot from human-loaded web page? State of the art is still loading a long list of well-known SE bots and parse USER AGENT?

Testing has to be done before the page is loaded! No gifs or captchas!

alex
  • 479,566
  • 201
  • 878
  • 984
Riccardo
  • 2,054
  • 6
  • 33
  • 51
  • possible duplicate of [Tell bots apart from human visitors for stats?](http://stackoverflow.com/questions/1717049/tell-bots-apart-from-human-visitors-for-stats) – Pekka Nov 05 '10 at 14:17
  • Pekka, yes it is very similar! However there was no accepted solution over there... who knows maybe someone can enlighten us this time? :-) – Riccardo Nov 05 '10 at 14:34
  • The answers here are assuming the bot may be trying to spoof the User Agent. What about if the bot is willing to state it is a bot? It seems like the robots.txt specification is being standardized and RPA is becoming more common so I am assuming this will be an issue soon. – Damien Golding Dec 26 '19 at 07:06

9 Answers9

5

If possible, I would try a honeypot approach to this one. It will be invisible to most users, and will discourage many bots, though none that are determined to work, as they could implement special code for your site that just skipped the honeypot field once they figure out your game. But it would take a lot more attention by the owners of the bot than is probably worth it for most. There will be tons of other sites accepting spam without any additional effort on their part.

One thing that gets skipped over from time to time is it is important to let the bot think that everything went fine, no error messages, or denial pages, just reload the page as you would for any other user, except skip adding the bots content to the site. This way there are no red flags that can be picked up in the bots logs, and acted upon by the owner, it will take much more scrutiny to figure out you are disallowing the comments.

Matthew Vines
  • 27,253
  • 7
  • 76
  • 97
  • 1
    Honeypots are a pretty good approach for dealing with automated "casual" web spiders, but of course they can't help with any kind of targeted bot activity. – Gareth Nov 05 '10 at 14:20
  • Yeah I don't think this is where the fight ends against malicious bots, but this is a good first step, and it may keep your site spam free for quite some time, until you get really popular, and bots begin to target you specifically, then you have to step up your game a bit. – Matthew Vines Nov 05 '10 at 14:23
3

Without a challenge (like CAPTCHA), you're just shooting in the dark. User agent can trivially be set to any arbitrary string.

Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
2

What the others have said is true to an extent... if a bot-maker wants you to think a bot is a genuine user, there's no way to avoid that. But many of the popular search engines do identify themselves. There's a list here (http://www.jafsoft.com/searchengines/webbots.html) among other places. You could load these into a database and search for them there. I seem to remember that it's against Google's user agreement to make custom pages for their bots though.

Nathan MacInnes
  • 11,033
  • 4
  • 35
  • 50
1

The user agent is set by the client and thus can be manipulated. A malicious bot thus certainly would not send you an I-Am-MalBot user agent, but call himself some version of IE. Thus using the User Agent to prevent spam or something similar is pointless.

So, what do you want to do? What's your final goal? If we knew that, we could be better help.

NikiC
  • 100,734
  • 37
  • 191
  • 225
  • I need to collect USER stats, filtering out non-human user agents, wish to do this by myself, no tools such as Google Analytics, please! – Riccardo Nov 05 '10 at 14:28
  • Although this is a useful article in itself, it does not qualify as an _answer_; Would have been better suited as a comment, in my opinion. – Core Xii Nov 05 '10 at 14:54
  • @Core: No, this is not a comment. I answer, that he can't solve the problem this way. At least there is no reason at all to downvote. – NikiC Nov 05 '10 at 15:27
  • He asked for the _best_ way to do it, not if it could/should be done the way he figured. So your answer adds little to the table. – Core Xii Nov 05 '10 at 16:06
  • First please tell me whether you have downvoted or not, so I know whether it makes sense to participate in this absolutely off-topic discussion. PS: He does actually ask whether you should use the user-agent for identifying a bot (at least he asks whether it is the state of art.) – NikiC Nov 05 '10 at 16:38
1

The creators of SO should know why they are using Captcha in order to prevent bots from editing content. The reason is there is actually no way to be sure that a client is not a bot. And i think there never will be.

Thariama
  • 50,002
  • 13
  • 138
  • 166
1

I myself is coding web crawlers for different purposes. And I use a web browser UserAgent.

As far as I know, you cannot distinguish bots from humans if a bot is using a legit UserAgent. Like:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.11 (KHTML, like Gecko) Chrome/9.0.570.1 Safari/534.11

The only thing I can think of is JavaScript. Most custom web bots (like those that I code) can't execute javascript codes because it's a browser job. But if the bot is linked or using a web browser (Like firefox) then it will be undetected.

Ruel
  • 15,438
  • 7
  • 38
  • 49
  • Custom web bots certainly *could* execute javascript, even if they aren't running inside a graphical browser – Gareth Nov 05 '10 at 14:18
  • @Gareth I agree, edited my answer. I'm actually referring to simple scrapers which doesn't need javascript. I think majority of those don't. – Ruel Nov 05 '10 at 14:21
0

Honest bots, such as search engines, will typically access your robots.txt. From that you can learn their useragent string and add it to your bot list.

Clearly this doesn't help with malicious bots which are pretending to be human, but for some applications this could be good enough if all you want to do is filter search engine bots out of your logs (for example).

Peter Bagnall
  • 1,794
  • 18
  • 22
0

I'm sure I'm going to take a votedown on this, but I had to post it: Constructive

In any case, captchas are the best way right now to protect against bots, short of approving all user-submitted content.

-- Edit --

I just noticed your P.S., and I'm not sure of anyway to diagnose a bot without interacting with it. Your best bet in this case might be to catch the bots as early as possible and implement a 1 month IP restriction, after which time the BOT should give up if you constantly return HTTP 404 to it. Bot's are often run from a server and don't change their IP, so this should work as a mediocre approach.

Craige
  • 2,882
  • 2
  • 20
  • 28
  • (IP) Proof of concept: there is no public proxy chain implementation for PHP/Perl/Python. (there may be a single proxy though, but usually bots are not so paranoidal) – kagali-san Nov 05 '10 at 15:43
0

I would suggest using Akismet, a spam prevention plugin, rather than any sort of Captcha or CSS trick because it is very excellent at catching spam without ruining the user experience.

Lotus Notes
  • 6,302
  • 7
  • 32
  • 47