0

I think I'm having a problem with bots and crawlers inflating my reads count (basically a hit counter on a blog post that goes up +1 each refresh).

Is there any way I can filter out the bots and crawlers? I thinking maybe using $_SERVER['HTTP_USER_AGENT'] to filter with but I'm not sure how to go about this or if it would even work

Or even if anyone has any better ideas...

Ivan
  • 34,531
  • 8
  • 55
  • 100
FoxyFish
  • 874
  • 1
  • 12
  • 21
  • its not very reliable, but yes you could use $_SERVER['HTTP_USER_AGENT'] –  Jul 04 '18 at 21:19
  • You should store the IP address in a temporary database, then it will affect the counter for each IP address only once in 24 hours for example. – HTMHell Jul 04 '18 at 21:20
  • 1
    Possible duplicate of [how to detect search engine bots with php?](https://stackoverflow.com/questions/677419/how-to-detect-search-engine-bots-with-php) –  Jul 04 '18 at 21:22
  • It wouldnt work as i have lots of different posts with a read counter, so each would need its own individual ip storage. – FoxyFish Jul 04 '18 at 21:23
  • Ofcourse it would. A great database for this kind will be Redis. You can store something like that: `view:{post_id}:{ip_address}`, then increase your counter only if this key doesn't exist. – HTMHell Jul 04 '18 at 21:25
  • A. IP does not equal person(or bot) (1 person many IP's one IP thousands of people), B this would still count bot hits –  Jul 04 '18 at 21:29
  • Yeah, it would limit the bots to 1 hit per day, but i want to eliminate them from hitting at all if possible, and i don't want to limit a regular reader. – FoxyFish Jul 04 '18 at 21:30
  • if you decide to use user agent, know that some bots lie, and new bots show up all the time so keeping track of user agents wont be trivial. I dont think many people would ever bother to do this. –  Jul 04 '18 at 21:32
  • I was thinking doing the opposite. Instead of a huge ban list of bots, having a whitelist of allowed and only allow those. But again i wasnt sure if that would work or what all the genuine user agents are. I cant think of any alternatives to user agent to deal with this though? – FoxyFish Jul 04 '18 at 21:35
  • new legit user agents appear daily also –  Jul 04 '18 at 21:44
  • https://github.com/matomo-org/device-detector is a fantastic device detection which includes a `$dd->isBot()` function if the overhead of adding this to your project is tolerable. This package is part of Matomo/Piwik Analytics, but functions perfectly on it's own. – Scuzzy Jul 04 '18 at 22:06

1 Answers1

1

You could use this trick, to check if the browser actually has cookies and javascript enabled, most bots don't, but most bots do fake a valid user agent.

 $browser = get_browser(null, true);
 if($browser['javascript'] !== 1 || $browser['cookies'] !== 1){
      //probably a bot
 }

Another way to do it that might also fail is to check if a session has been started. Many bots, as they don't accept cookies or have cookies enabled, won't then start a session (due to missing cookies data in header).

 if(!$_SESSION){//bot probable}

or even check for a session variable you would set at the beginning of a session

 if(!isset($_SESSION['your_var'])){ //bot probable}
Eric
  • 9,870
  • 14
  • 66
  • 102
  • Thanks. Thats a pretty good idea so long as bots don't start using cookies and js. – FoxyFish Jul 04 '18 at 21:50
  • wont work for all bots. but i think the OP knows nothing is going to work 100% –  Jul 04 '18 at 21:51
  • Yeah, i realise there is zero chance of a definitive bot preventer, but if it stops a good chuck of them thats good enough, its only a counter at the end of the day, but a slightly truer representation is better than way overinflated. – FoxyFish Jul 04 '18 at 21:53
  • Cant add it, shared hosting. – FoxyFish Jul 04 '18 at 23:14