8

How can I prevent my asp.net 3.5 website from being screen scraped by my competitor? Ideally, I want to ensure that no webbots or screenscrapers can extract data from my website.

Is there a way to detect that there is a webbot or screen scraper running ?

Rubén
  • 34,714
  • 9
  • 70
  • 166
user279521
  • 4,779
  • 21
  • 78
  • 109

8 Answers8

12

It is possible to try to detect screen scrapers:

Use cookies and timing, this will make it harder for those out of the box screen scrapers. Also check for javascript support, most scrapers do not have it. Check Meta browser data to verify it is really a web browser.

You can also check for requests in a minute, a user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. If this starts to affect crawlers, log the users ip that is blocked, and start allowing their IPs as needed.

You can use http://www.copyscape.com/ to proect your content also, this will at least tell you who is reusing your data.

See this question also:

Protection from screen scraping

Also take a look at

http://blockscraping.com/

Nice doc about screen scraping:

http://www.realtor.org/wps/wcm/connect/5f81390048be35a9b1bbff0c8bc1f2ed/scraping_sum_jun_04.pdf?MOD=AJPERES&CACHEID=5f81390048be35a9b1bbff0c8bc1f2ed

How to prevent screen scraping:

http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html

Community
  • 1
  • 1
James Campbell
  • 5,057
  • 2
  • 35
  • 54
  • +1 good answer. but... I have beaten most of those guards, thus my answer. ;-) – Sky Sanders Apr 24 '10 at 17:49
  • 1
    His question is, is it possible to detect. It is, and it is easy to make it a pain to write a program to scrape the site, it is not 100% but it will make it harder. If a user can bring it up in the browser, it can be scripted, unless you use captcha to access the info you don't want scraped. – James Campbell Apr 24 '10 at 17:51
  • Yes, you are right. I am guilty of answering a different question. – Sky Sanders Apr 24 '10 at 20:23
9

Unplug the network cable to the server.

paraphrase: if public can see it, it can be scraped.

update: upon second look it appears that I am not answering the question. Sorry. Vecdid has offered a good answer.

But any half decent coded could defeat the measures listed. In that context, my answer could be considered valid.

Sky Sanders
  • 36,396
  • 8
  • 69
  • 90
2

I don't think it is possible without authenticating users to your site.

Raj Kaimal
  • 8,304
  • 27
  • 18
2

You could use a CAPTCHA.

Also, you can mitigate it instead by throttling their connection. It won't completely prevent them from screen scraping but it will probably prevent them from getting enough data to be useful.

First, for cookied users, throttle connections so you can see at most one page view per second, but once your one-second timer is up you experience no throttling whatsoever. No impact on normal users, lots of impact on screen scrapers (at least if you have a lot of pages they're targeting).

Next, require cookies to see the data-sensitive pages.

They'll be able to get in, but as long as you don't accept bogus cookies, they won't be able to screen scrape much with any real speed.

Chris Moschini
  • 36,764
  • 19
  • 160
  • 190
1

Ultimately you can't stop this.

You can make it harder for people to do, by setting up the robots.txt file etc. But you've got to get information onto legitimate users screens so it has to be served somehow, and if it is then your competitors can get to it.

If you force users to log in you can stop the robots all the time, but there's nothing to stop a competitor registering for your site anyway. This may also drive potential customers away if they can't access some information for "free".

ChrisF
  • 134,786
  • 31
  • 255
  • 325
1

If your competitor is in same country as you, have an acceptable use policy and terms of service clearly posted on your site. Mention the fact that you do not allow any sort of robots/screen scraping etc. If that continues, get an attorney to send them a friendly cease and desist letter.

Widor
  • 13,003
  • 7
  • 42
  • 64
Strong Like Bull
  • 11,155
  • 36
  • 98
  • 169
0

I don't think that's possible. But whatever you'll come up with, it'll be as bad for search engine optimization as it will be for the competition. Is that really desirable?

JulianR
  • 16,213
  • 5
  • 55
  • 85
0

How about serve up every bit of text as an image? Once that is done, either your competitors will be forced to invest OCR technologies, or you will find that you have no users - so the question will be moot.

Peter M
  • 7,309
  • 3
  • 50
  • 91