How can I block all website traffic except for specific bots and valid browser user agents?

Question

Sorry if this has been asked before, but I've been researching for hours now with no real definitive answer.

I am working on a site that has had some serious security flaws in the past. These have been fixed (and I am constantly checking to make sure there aren't others), but the site is getting hammered by bots. I've implemented some checks in PHP using a 3rd party to ban known spam IP addresses and I have blocked referrer such as semalt in .htaccess, which has helped massively, but it's not enough.

Because the problem is so bad and it takes me so long (hours per day) to manually block IPs, host names, etc, I wanted to take a more aggressive approach. Rather than blocking specific details, I'd rather just let through what I want using htaccess: - Good bots like Google, MSN, Yahoo, etc. - Anyone with a hostname.

I realise this will still let some bad bots through, but the majority of traffic comes from bots without a hostname, so it will be a good start.

I have two questions:

1) Is there a better way to approach this?

2) If not, how do I achieve this?

This is what I have so far (I have a bigger list of browsers), but it does not seem to work:

#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Google good_pass
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase ^Yahoo good_pass
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase Chrome good_pass
BrowserMatchNoCase Mozilla good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well
Order Deny, Allow
Deny from all
Allow from env=good_pass

There is no real answer to this since there is no secure way to determine the type of client making a request. All strategies rely on data provided by the clients which obviously can be faked without limitation by the clients. And which typically _is_ faked by "bad bots". — arkascha, Oct 31 '15 at 17:18
Hi, thanks for your comment. I do appreciate that, but I just need to get this under control for the moment. As far as I am aware, anything without a hostname is a bot? So as long as I let through what appears to be the bots that I want, I can get rid of a lot of the problem by blocking the rest (i.e. no hostname and not Google, Yahoo, MSN, etc). — thebronsonite, Oct 31 '15 at 17:31
Not sure what you mean by "without a hostname", but sure, go ahead, block whatever requests you want to. You will still serve requests from bots disguising, but as said: you cannot really prevent that. — arkascha, Oct 31 '15 at 17:32
Fingers crossed I can get it to a manageable level, then look at what I can do about sneaky bots. By "without a hostname", I mean hits I get from IPs like this one: 112.111.185.92. Looking up that IP on ipinfo.io or similar will show no hostname (some lookups default to the IP). Compare that to a Baidu spider (bad example!) and the hostname shows as "baiduspider-123-125-71-110.crawl.baidu.com" — thebronsonite, Oct 31 '15 at 17:46
Ah, you are talking about a DNS lookup to legitimate a request. Not exactly what it is meant for, but as said: sure, if you feel like blocking such requests, then go. However one question: _why_ do you want to block requests anyway? I mean you make information available to public access by purpose (that is what an http server is for). So what is the point in then denying access to it again? Or could it be that you lack some form of authorization strategy, since the information is _not_ meant for the public? That would be a completely different issue. — arkascha, Oct 31 '15 at 17:49
I want to block them for a few reasons: 1) Resources - The site is hitting bandwidth limits and affecting other sites on the server. 2) Stats - It's almost impossible to monitor how well the site is doing. 3) User experience - The amount of hits is surely taking a toll on site loading times for legitimate customers. There must be some sort of solution to help with a problem like this, isn't there? — thebronsonite, Oct 31 '15 at 17:55
Sorry, no, as said in the beginning there is _not_ some "solution" to this, since you cannot really decide what requests are "legitimate" and what are not. Since such bots do nothing "illegal" or anything, your blocking attempt actually falsifies the numbers collected by your monitoring, since you block arbitrary but legitimate requests. You made some vague remarks about what might indicate something, but that is nothing usable for a "real" solution. If some harmless bot requests are really slowing down normal usage then this sounds more like a general issue with performance. — arkascha, Oct 31 '15 at 18:02
As I said, I just want to get this down to a manageable level. Bots that make up to 14,000 requests per IP per day (from thousands of different IPs) are not something that I'm going to watch and not attempt to do something about, regardless of whether it is legal or not. — thebronsonite, Oct 31 '15 at 18:08
look [this answer](http://stackoverflow.com/a/131685/5397119) — Sergio Ivanuzzo, Nov 01 '15 at 00:43

How can I block all website traffic except for specific bots and valid browser user agents?

0 Answers0