1

I created a download script which makes users wait five seconds before the downlaod starts automatically and also counts downloads. It's very simple. Now I need to find a way to block bots because I want the downlod count to be as realistic as possible, meaning I want it to count only users actually downloading and not bots. Is there a bots list somewhere or just a way to do what I have to? Thanks.

Gabriele
  • 11
  • 2

5 Answers5

3

Normal "bots" aren't able to run javascript, so they can't wait (download it).

You can add capcha if you're afraid there are bots with knowledge of "javascript"

genesis
  • 50,477
  • 20
  • 96
  • 125
  • 1
    I don't use javascript, just a PHP script that adds ?action=download after five seconds and starts the download. Any ideas other than captcha? – Gabriele Aug 17 '11 at 06:34
  • @daGrevis: Why? He says he wants fairly reliable statistics, not an extra step for his users. – Eric J. Aug 17 '11 at 06:41
  • 1
    @daGrevis: CAPTCHA has been broken at scale since 2008 http://blogs.itbusiness.ca/2010/08/how-cyber-crooks-break-captchas/ – Eric J. Aug 17 '11 at 06:47
  • @daGrevis That's an excellent point, but [here is a counter-argument](http://www.90percentofeverything.com/2011/03/25/fk-captcha/). – sdleihssirhc Aug 17 '11 at 06:50
  • @daGrevis: reCAPTCHA is also broken http://stackoverflow.com/questions/448963/has-recaptcha-been-cracked-hacked-ocrd-defeated-broken – Eric J. Aug 17 '11 at 06:51
2

Well-behaved robots should respect robots.txt, which allows you to instruct robots how they are allowed to crawl your website.

You cannot reliably block non-well-behaved robots (sort of attempts at human detection such as captcha, as others have suggested). Even though many robots set a special user agent (you can see examples here), a robot can set the user agent to anything it wants to.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
1

Use a captcha. I would suggest you to use Recaptcha.

Jose Adrian
  • 1,217
  • 1
  • 17
  • 32
  • I "can't", I prefer the download to start as fast as possible. Any ideas other than captcha? – Gabriele Aug 17 '11 at 06:36
  • I don't know how bots work but maybe if the user leaves the page, the download should stop. There is a script in PHP which do that. I don't remember where you can get it but give it a try. – Jose Adrian Aug 17 '11 at 06:43
1

There are various methods that you could use to get rid of a bot, but they'll also filter out some real users:

  • Only allow clients that send an acceptable User-Agent string.
  • Only allow clients that have JavaScript enabled.
  • Only allow clients that have cookies enabled.
  • Only allow clients that uncheck a checkbox that says, "I'm a robot."
  • Only allow clients that don't fill out a honeypot text input.
  • Have a CAPTCHA (this is used by webmasters who hate their users and have no respect for them; only suggested for sadists and jerks)

You can pick and choose, or combine them to create your own flavor of robot discrimination.

sdleihssirhc
  • 42,000
  • 6
  • 53
  • 67
  • It's almost impossible to define "acceptable" User-Agent string (I recently parsed 32 million log entries and found 2 million unique user agent strings, most of which are non-bots); even if you could a bot can send any user agent string it wants too; bots can execute JavaScript; bots can accept cookies; bots can permute post parameters to simulate various form input; CAPTCHA (as you note) isn't really user friendly. – Eric J. Aug 17 '11 at 06:43
  • @Eric I agree with everything you say. While that's a pretty freaking sophisticated bot that could do all of those things (supposedly, if it can manipulate dynamically-generated HTML, send/set cookies, parse forms to figure out which fields to manipulate and which to ignore, etc., then it will also be able to crack most weak CAPTCHAs), it's still totally possible. And in my opinion, treating a real user as if they were a bot just because they had a black-listed User-Agent or disabled JavaScript is unacceptable. So I personally use a handful of these methods and accept the false positives. – sdleihssirhc Aug 17 '11 at 06:49
  • So what do you suggest, Eric? – Gabriele Aug 17 '11 at 06:49
  • 1
    @Gabriele: Bot detection is a bit like an arms war... how much effort you put into it depends on how bad the consequences of allowing a bot through now and then. If you're interested in getting mostly-accurate stats, just adding robots.txt as suggested in my answer is probably a good solution. Catching bad-behaving bots (especially without making life harder on your users) takes a lot of effort to catch relatively few bots. – Eric J. Aug 17 '11 at 06:56
0

Honeypot fields and timestamp analysis.

daGrevis
  • 21,014
  • 37
  • 100
  • 139