0

I'm trying to build a script which shows me a list of IP's that are bots/spiders.

I wrote a script which imports the access log of Apache to a mysql db so I can try to manage it with php and mysql.

I've noticed a lot of bots have regular intervals, they send out a request every 2 or 3 seconds. Is there an easy way of showing these patterns with a query or php script? Or, even harder I think, is there an algorithm that can recognise these bots / spiders.

DB:

CREATE TABLE IF NOT EXISTS `access_log` (
  `IP` varchar(16) NOT NULL,
  `datetime` datetime NOT NULL,
  `method` varchar(255) NOT NULL,
  `status` varchar(255) NOT NULL,
  `referrer` varchar(255) NOT NULL,
  `agent` varchar(255) NOT NULL,
  `site` smallint(6) NOT NULL
);
PvdL
  • 1,578
  • 1
  • 20
  • 33
  • See [tell bots apart from human visitors for stats?](http://stackoverflow.com/questions/1717049/tell-bots-apart-from-human-visitors-for-stats) it might answer your question already – Pekka Feb 24 '11 at 09:57

1 Answers1

0

Official bots will identify themselves. There's a list at http://www.robotstxt.org/db.html

For the unofficial ones I guess you could try looking for some of the following:

  • Page requests with no other resource requests (images, css and JavaScript etc)
  • Strange URL requests (lot's of requests for login pages, especially ones that don't exist such as wp-admin on a drupal site)
  • Successive page view's in a short amount of time
  • Exactly the same URL signatures coming from many different IP's
  • No HTTP referrer for IP's that you've never seen before
  • Lot's of comment posts in a short session
  • Requests from public proxy servers

That's some of the thing's I've noticed about the annoying ba***s that keep trying to scrape and spam my site anyway. Some of them would probably need to be combined in order to filter out real requests with the same characteristics.

Ewan Heming
  • 4,628
  • 2
  • 21
  • 20
  • Well the hardes part is yet again to see who's the good and who's the abd guy. MSN bots (not bing) have a lot of IP's are the good guys but act like bad ones... – PvdL Feb 28 '11 at 10:23