3

The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?

One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?

I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.

I developed a tool to scrape another site. So I only want to prevent this from happening to my site!

derloopkat
  • 6,232
  • 16
  • 38
  • 45
banupriya
  • 1,220
  • 5
  • 21
  • 51

6 Answers6

9

DON'T do it.

It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.

The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.

NullUserException
  • 83,810
  • 28
  • 209
  • 234
3

It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.

However, here are some ideas for the time being.

And here are a few more.

and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.

a Google search for "Captcha alternatives" turns up quite a bit.

David
  • 72,686
  • 18
  • 132
  • 173
  • yes, but you can't leave your website left without any security features just because every security feature can be broken one day. – Ankit Jaiswal Aug 19 '10 at 05:26
  • Agreed, but rule #1 of security is to assume your site is vulnerable and implement defense in depth. And I have to wonder, how often is it that it matters if it's a human reading the site or not? That should be ONE of the concerns, but I have yet to come across a situation where that would be a deal-breaker. Secure the site with everything at your disposal, and the human vs. bot issue is less of a factor. – David Aug 19 '10 at 05:27
  • 1
    yes, I think here the aim should be to make the scraping difficult rather than finding whether the user is a machine or a human being.. – Ankit Jaiswal Aug 19 '10 at 05:38
1

I should make a note that if there's a will, then there is a way.

That being said, I thought about what you've asked previously and here are some simple things I came up with:

  1. simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/

  2. you can always display your data in flash, though I do not recommend it.

  3. use a captcha

Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.

EDIT:

Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.

Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.

Mahmoud Abdelkader
  • 23,011
  • 5
  • 41
  • 54
1

This cant be done without risking false positives (and annoying users).

How can we detect whether a human is viewing the site or a tool?

You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?

For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?

You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.

sisve
  • 19,501
  • 3
  • 53
  • 95
  • Still sites like Google have tracking logic to find whether human intervention is involved in the site usage! I want to know that logic only so that we can prevent these tools at least to some extent! – banupriya Aug 20 '10 at 06:31
1

The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.

Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.

Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.

Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.

Ankit Jaiswal
  • 22,859
  • 5
  • 41
  • 64
  • 1
    How does creating cookies prevent automated tools? Whether a tool or a human enters that page, cookie will be created know? Please provide the JavaScript code for displaying the encrypted and decrypted code on load time. In my site, i use base64 encryption to encrypt password. Shall i encrypt query strings also? – banupriya Aug 19 '10 at 05:56
  • I did not say creating cookies prevents automated tools, it just makes it difficult to create scrapping tools and requires extra efforts. – Ankit Jaiswal Aug 19 '10 at 06:31
  • See here for the encrypted html. http://www.iwebtool.com/html_encrypter similar thing can be implemented on your site as well. – Ankit Jaiswal Aug 19 '10 at 06:37
0

I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.

teukkam
  • 4,267
  • 1
  • 26
  • 35