How to protect website from bulk scraping /downloading?

Question

I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.

This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.

I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.

Thank you for your on-topic answers and possible solution ideas.

I'm just waiting for the related question `How do I download a protected website automatically?` ;) — phihag, Aug 01 '11 at 09:30
perhaps any solution is so flawed as to be simply pointless, and you accept what happens to data put in the Internet. — , Aug 01 '11 at 09:32
I stated that it's obvious to me that it can be always achieved. However, I just want to make life harder to those novice-level kids who will get stuck when trying and will not have motivation to try it in other ways... — Frodik, Aug 01 '11 at 09:35

score 2 · Answer 1 · answered Dec 27 '13 at 18:24

Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with @symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.

Other ways to combat web scrapers are:

Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.

This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.

score 1 · Answer 2 · answered Aug 01 '11 at 09:37

Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).

Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.

Also it means that any deep links into your site won't work - but maybe you want that anyway?

You could also just enable it for images which makes it a bit harder for them to be scraped from the site.

Wow, drive by downvotes already? I think the OP asked a reasonable question and was clearly aware of the limitations. I've used this technique to add a bit of difficulty for script kiddies in the past and it does work, within the limitations of anything on the internet being largely fair game. — Roger, Aug 01 '11 at 09:55

score 1 · Accepted Answer · answered Aug 01 '11 at 12:47

Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.

How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.

You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.

Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.

Do not tar pit connections unless you've got a lot of resource serverside!

score 0 · Answer 4 · answered Aug 15 '17 at 19:25

0

If you don't mind using an API, you can try our https://ip-api.io

It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.

answered Aug 15 '17 at 19:25

Andrey E

605
4
8

score 0 · Answer 5 · answered Aug 01 '11 at 09:56

Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:

SetEnvIf User-Agent ^Wget/[0-9\.]* downloader

Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.

If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.

score -1 · Answer 6 · answered Aug 01 '11 at 09:35

I would advice one of 2 things,

First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.

Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.

I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.

If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.

score -2 · Answer 7 · answered Aug 01 '11 at 09:36

You could use a counter (DB or Session) and redirect the page if the limit is triggered.

/**Pseudocode*/  
if( ip == currIp and sess = currSess)  
       Counter++;    
if ( Count > Limit )  
    header->newLocation;

I think dynamic blocking of IPs using IP blocker will help better.

How to protect website from bulk scraping /downloading?

7 Answers7