Blocking Web Scrapers

Question

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?

It may help to add whether this is a currently existing website or a new development. Whether the technology you are using RoR, .NET or if that decision has not yet been made and you're just looking for high level ideas (which may even help guide decision as to which technology to use) — Paul Hadfield, Aug 05 '10 at 08:58
This is preliminary so I am just trying to get some high level ideas of some complex and basic ways to block web scrapers. — cnaut, Aug 05 '10 at 15:30
HTTP access is HTTP access. What's the difference if I write a program that downloads your webpage versus I tell firefox to do the same? There is no intrinsic difference. — Karl, Aug 23 '10 at 00:47
Walk up to your webserver and pull out the ethernet cable. That works most of the time. — Jeffrey Greenham, Feb 08 '11 at 22:35

score 4 · Answer 1 · answered Aug 05 '10 at 07:08

4

Captchas
Form submitted in less than a second
Hidden (by css) field gets a value submitted during form submit
Frequent page visits

Simple bots can not scrap text from flash, images or sound.

answered Aug 05 '10 at 07:08

Muhammad Hasan Khan

34,648
16
88
131

3

All of those options (whist valid) could also block legal crawlers such as Google badly affecting your page rank, plus captchas would get in the way of normal users. Doesn't answer the question of how you could identify your site being accessed by a bot either. – Paul Hadfield Aug 05 '10 at 07:45

score 2 · Answer 2 · answered Aug 23 '10 at 01:01

Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.

However, here are some methods that can be implemented:

Check User-Agent (this could be spoofed though)
Use robots.txt (proper bots will - hopefully respect this)
Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.

It's easy to break the template into sections for the scraper even if you do change the layout of your code a little bit. — Duniyadnd, Oct 27 '10 at 17:32

score 1 · Answer 3 · answered Aug 05 '10 at 07:37

You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.

score 0 · Answer 4 · answered Aug 23 '10 at 00:41

I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...

score 0 · Answer 5 · answered Aug 23 '10 at 00:59

0

Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.

answered Aug 23 '10 at 00:59

Weston C

3,642
2
25
31

score 0 · Answer 6 · answered Sep 23 '10 at 06:04

Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/

From their site:

Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.

Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.

Dunno why this was downvoted. Bad Behavior does indeed block a wide variety of web scrapers. I should know, I wrote it. — Michael Hampton, May 12 '13 at 02:17

Blocking Web Scrapers

6 Answers6