2

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?

What is going on:

I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.

Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.

What I have done so far:

I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.

Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..

EDIT #1:
I handled the situation currently, with:

<?

# Get end IP
define('CLIENT_IP', (filter_var(@$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? @$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(@$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? @$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));

# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {

    # Tell them not right now:
    Header('HTTP/1.1 503 Service Temporarily Unavailable');

    # ..and block the request
    die();
}

It works. But it seems like another temp solution to a more systematic problem.

I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.

EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.

First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".

Based on the correct answer, I added disallow rows for parameter q to robots.txt:

Disallow: /*?q=*
Disallow: /*?*q=*

I also told inside the bing webmaster console, to not bother us on peak traffic.

Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).

Kalle H. Väravas
  • 3,579
  • 4
  • 30
  • 47
  • Also, lets say I detect 157.55.39.* and block it off, then what is the most appropriate response? HTTP 503 or 400? – Kalle H. Väravas Dec 24 '17 at 21:35
  • Don't you have a robots.txt that tells decent scrapers to not go down the search path? – rene Dec 24 '17 at 21:35
  • Did you try to store flood limit data not in the cookies? Cuz it's very silly solution, they never store/use cookies. A quite easy solution is to use MemCached to store such data, but it heavily depends on the size of your project (load). p.s. You also could change the rules for the Search Engines via robots.txt – Abraham Tugalov Dec 24 '17 at 21:38
  • Maybe this more on-topic on Webmasters.se https://webmasters.stackexchange.com/questions/87393/how-to-block-the-most-popular-spider-crawlers-via-robots-txt and https://webmasters.stackexchange.com/questions/23084/ms-bing-web-crawler-out-of-control-causing-our-site-to-go-down and I see some specific blocking and redirecting as well there. – rene Dec 24 '17 at 21:40
  • @AbrahamTugalov I used the cookie for temp-quick-dirty solution. But I am thinking, that using memcache/redis for checking floods is the only solution. But is the simplest solution checking the last search query timestamp? I wish /search.html to be indexed, but /search.html?q=test not -- there for on $_GET['q'] it adds NOINDEX, FOLLOW. – Kalle H. Väravas Dec 24 '17 at 21:41
  • @rene Thank you for the links. The thing is, that I don't want to block them off search I want to control their search query count. I don't also want to slow down their crawler. The server can more then easily handle it, but right now they are doing alot alot of pointless queries, which is not very load-balancy. If all else fails, I will block bing off with 503 when $_GET['q'] is detected. – Kalle H. Väravas Dec 24 '17 at 21:44
  • Just create a decent disallow rule in your robots.txt for the bing bot to not do queries. – rene Dec 24 '17 at 21:48
  • @rene How can I disallow an URL parameter "q" only, but keep the /search.html allowed? -- Though it doesn't fix the potential for flood by some other bot in the future, which is my original question. On how to protect against flood. – Kalle H. Väravas Dec 24 '17 at 21:50
  • For only an url parameter I'm not sure. It looks like robots,txt only allows for paths. Sorry. – rene Dec 24 '17 at 21:55

1 Answers1

1

Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.

But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.

Your robots.txt file would look like:

User-agent: bingbot
Disallow: /search.html?q=

But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:

User-agent: bingbot
crawl-delay: 10

This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.

But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.

In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).

Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.

Of course, this is not the best solution you could use, but it will work and will cost not much for your server.

Abraham Tugalov
  • 1,902
  • 18
  • 25
  • Thank you, overall this solves my problem. I would like to add, that bing appears to have the same ignore URL parameters as does google. Which allows you to add "q" to be ignored. But Disallow: /search.html?q= I did not know, and this will perfectly help me out. – Kalle H. Väravas Dec 24 '17 at 22:04
  • Glad to help, accept the answer if you have the solution. – Abraham Tugalov Dec 24 '17 at 22:05
  • So i modified it a bit: `Disallow: /*?q=* Disallow: /*?*q=*` This blocks all possible q parameters. – Kalle H. Väravas Dec 24 '17 at 22:11