1

Possible Duplicate:
How do you stop scripters from slamming your website hundreds of times a second?

I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data. In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.

I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.

I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?

EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)

Community
  • 1
  • 1

4 Answers4

3

My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.

Mobius
  • 2,685
  • 4
  • 18
  • 17
  • This is not a solution for the throttling problem, but it is indeed a smart trick! Thanks! –  May 22 '09 at 20:46
  • I'm sorry I didn't answer the question you asked! I'm way, way too dumb with RoR to even attempt to offer a programatical suggestion. – Mobius May 22 '09 at 20:58
1

You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.

lothar
  • 19,853
  • 5
  • 45
  • 59
  • Wouldn't they just set an appropriate delay in how often they gather data? – nevets1219 May 22 '09 at 20:51
  • @nevets1219 well you can not stop them completely, just slow them down or make their work harder. The OP already acknowledged that. – lothar May 22 '09 at 20:58
1

You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).

chrisrbailey
  • 2,220
  • 1
  • 20
  • 13
  • This kind of triggers me to follow up with another question: I don't know about Rack. Can I use the rate limiting with Ruby On Rails, possible with Rack chucked in between? –  May 27 '09 at 21:43
  • Felix, I'm not sure if I fully understand that question, but... at least part of it depends on what your stack is and what your Rails version is. I think if you're on Rails 2.2 (maybe 2.1?) then you're set/compatible for Rack. Then, you'd need to be using a Rack based stack, and there are a variety of options. Passenger, or Thin, or what not. But really, the point is that Rack is part of your web server stack, and that's part of the beauty of that rate limiting implementation is that they do it essentially without you having to touch your app - it's all at the Rack middleware layer. – chrisrbailey Jun 02 '09 at 20:38
0

I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Martin Murphy
  • 1,775
  • 2
  • 16
  • 24
  • This looks like a good way to do just that: http://www.elxsy.com/2009/06/how-to-identify-and-ban-bots-spiders-crawlers/ – Chad Johnson Feb 11 '12 at 15:17