5

I am getting hit numerous times by crawlers on a page which triggers an API call. I would like to limit access to that page for bots who do not respect my robots.txt.

Note: This question is not a duplicate.. I want rate limiting not IP blacklisting.

Abram
  • 39,950
  • 26
  • 134
  • 184
  • Maybe use `rack-attack` gem to whitelist bots that you want to allow? – Hesham Jan 13 '16 at 19:47
  • Possible duplicate of [Rack-Attack: Array of IP addresses](http://stackoverflow.com/questions/23915107/rack-attack-array-of-ip-addresses) – Mike Szyndel Jan 13 '16 at 20:39
  • I think that @kimgrey's response is more relevant seeing that I want to limit based on IP not entirely block IPs. – Abram Jan 14 '16 at 02:08

3 Answers3

8

Check out the gem: Rack::Attack!

Battle-tested in production environments.

Tilo
  • 33,354
  • 5
  • 79
  • 106
  • Hey just wondering, how would this be used in the case where bot IP addresses are dynamic? – Abram Jan 16 '16 at 19:50
  • it would still count by IP, and the bot has only a limited number of IPs at it's disposal... so most likely all of them will be flagged and blocked. But don't be too aggressive when setting your limits, because of requests from NAT'ed networks.. – Tilo Jan 16 '16 at 19:52
  • My current approach is to block them based on their headers ... Seems to be working well so far! – Abram Jan 16 '16 at 19:55
  • you can also use other identifiers than just the IP with Rack::Attack -- the video covers this afaik. – Tilo Jan 16 '16 at 19:57
  • sure thing. You can also throttle requests based on the path. It can also do blacklists. Very useful gem. – Tilo Jan 16 '16 at 20:06
3

If you are using redis in your project you can very simply implement requests counter for API request. This approach allows you not to just limit robots access, but limit different API request using different policies based on your preferences. Take a loook on this gem or follow this guide if you want to implement limit by yourself.

kimrgrey
  • 562
  • 2
  • 10
1

So, for anyone interested, I found an alternative solution that works without adding rack attack or redis. It's a little hacky, but hey, it might help someone else.

count = 0
unless Rails.cache.read("user_ip_#{get_ip}_count").nil?
  count = Rails.cache.read("user_ip_#{get_ip}_count") + 1
  if count > 20
    flash[:error] = "You're doing that too much. Slow down."
    redirect_to root_path and return false
  end
end
Rails.cache.write("user_ip_#{get_ip}_count", count, :expires_in => 60.minutes)

This limits any requests to the geocoder to 20/hour. For testing purposes:

def get_ip
  if Rails.env.production?  
    @ip = request.remote_ip
  else
    @ip = "{YOUR_IP}" 
  end
end

Update

I thought this was a great idea, but it turns out it doesn't work due to changing IP addresses of crawlers. I have instead implemented this rather simple code:

if request.bot?
  Rails.logger.info "Bot Request Denied from #{get_ip}"
  flash[:error] = "Bot detected."
  redirect_to root_path and return false
end

Using this handy rails gem: voight_kampff

Abram
  • 39,950
  • 26
  • 134
  • 184
  • 1
    I think your algorithm is incorrect: if you get a request every 10 minutes, you'll get the the error message after 3h20 (20*10 minutes) since you cache expiration date is reset every time you get a request. – romainsalles Mar 23 '17 at 10:11