1

I am using PHP to write my server side code for my website. What is the best way to prevent someone from scraping my data?

Like in PHP if someone uses file_get_contents() or someone fetches my login form in an iframe element or the data entered in the login form - how can I prevent such things?

I am using PHP 5.47, MySQL, HTML and CSS.

halfer
  • 19,824
  • 17
  • 99
  • 186
shubhraj
  • 391
  • 3
  • 12
  • There is no safe way. You might try to add some JS that checks cookies, screen resolution, referrers etc, but none of those are 100% reliable. – Peon Dec 05 '13 at 07:34
  • 1
    There's very little, if anything, you can do to prevent that. What's your *real* goal? – rath Dec 05 '13 at 07:34
  • 1
    A very complicated matter and most of the time not worth your while. – Kiruse Dec 05 '13 at 07:34
  • 1
    Even `youtube` is not able to prevent their videos getting grabbed by websites like `keepvid.com` etc – Shankar Narayana Damodaran Dec 05 '13 at 07:36
  • 1
    If your site publicly displays data, anyone with enough determination can get it in an automated way. A scraper is doing nothing different from a regular user. – deceze Dec 05 '13 at 07:36
  • what are you afraid of exactly ? IFrame or Robots ? – Mohebifar Dec 05 '13 at 07:43
  • @rath My login details on the website are clearly displayed using Live http Header add on in mozilla or chrome and also i have a product id to be entered i dont want that to be displayed on other websites thru iframes – shubhraj Dec 05 '13 at 08:00
  • @Mohebifar Iframes i have just commented above. – shubhraj Dec 05 '13 at 08:02
  • You must use javascript. Check if `window != top` then you're website is shown in an iFrame. you can do what you want here. – Mohebifar Dec 05 '13 at 08:06
  • @shubhraj: I just want to make sure I understand. www.scraper.com has an iframe in which they display your website www.tooawesome.com. Are you worried that when you log in to www.tooawesome.com, **your** log in details will be displayed in the www.scraper.com iframe? This is not possible (I don't think :/). www.scraper.com would have to create an account, log in and then **their** log in information would be displayed in the iframe. Is this what you mean? –  Dec 05 '13 at 08:50
  • Also `file_get_contents()` does not get your php file. It accesses the file and gets whatever PHP and the server gives it. If you have a `defined('_SOMECONSTANT') or die('Buzz off!')` (see my answer below) then all the scraper will "get" is "Buzz off!" if they access the file directly. _SOMECONSTANT will only be defined if the file is accessed properly through, for example, your index.php (or wherever you define _SOMECONSTANT). –  Dec 05 '13 at 15:05
  • @HighPriestessofTheTech I will try your method.Actually I was testing my website using a php script with file_get_contents() to check if data is fetched and it was getting fetched – shubhraj Dec 06 '13 at 05:04
  • @Mohebifar I dont want it to get displayed in iframe on someone else website. Ok i will try your method using Javascript but if js is disabled then? – shubhraj Dec 06 '13 at 05:06
  • @shubhraj The best way to see file_get_contents() is as a type of browser - a secret ninja PHP browser ^_^. How ever you see it in your browser is what file_get_contents() sees. It will see a txt file just like your browser. It **SHOULDN'T** see anything from your PHP script **unless the correct conditions are met**. If you are getting unwanted data showing, please post a skeleton of your code in the question so that we can have a look :) –  Dec 06 '13 at 06:43

4 Answers4

6

I think that being a web-developer these days is terrifying and that maybe there is a temptation to go into "overkill" when it comes to web security. As the other answers have mentioned, it is impossible to stop automated scraping and it shouldn't worry you if you follow these guidelines:

  • It is great that you are considering website security. Never change.

  • Never send anything from the server you don't want the user to see. If the user is not authorised to see it, don't send it. Don't "hide" important bits and pieces in jQuery.data() or data-attributes. Don't squirrel things away in obfuscated JavaScript. Don't use techniques to hide data on the page until the user logs in, etc, etc.

    Everything - everything - is visible if it leaves the server.

  • If you have content you want to protect from "content farm" scraping use email verified user registration (including some form of GOOD reCaptcha to confound - most of - the bots).

  • Protect your server!!! As best you can, make sure you don't leave any common exploits. Read this -> http://owasp.org/index.php/Category:How_To <- Yes. All of it ;)

  • Prevent direct access to your files. The more traditional approach is defined('_SOMECONSTANT') or die('No peeking, hacker!'); at the top of your PHP document. If the file is not accessed through the proper channels, nothing important will be sent from the server.

    You can also meddle with your .htaccess or go large and in charge.

Are you perhaps worried about cross site scripting (XSS)?

If you are worried about data being intercepted when the user enters login information, you can implement double verification (like Facebook) or use SSL

It really all boils down to what your site will do. If it is a run of the mill site, cover the basics in the bullet points and hope for the best ;) If it is something sensitive like a banking site... well... don't do a banking site just yet :P


Just as an aside: I never touch credit card numbers and such. Any website I develop will politely API on to a company with insurance and fleets of staff dedicated to security (not just little old me and my shattered nerves).

Community
  • 1
  • 1
0

No there is no way to make this sure. You can implement some Javascript functions which try to prevent this, but if the client just deactivate JS (or a server just ignores it), you can't prevent this.

Maarkoize
  • 2,601
  • 2
  • 16
  • 34
0

It is really hard to prevent this. I have found a similar discussion here. This will answer most of your queries but if you want even more perfect protection then sophisticated programs and services like Scrapesentry and Distil would be needed.

Community
  • 1
  • 1
-1

Using JavaScript or php, you just decrease the data scraping, but you can't stop the data scraping. Browser can read the html data so user can view your page source and get that. You can disable key events but can't stop the scraping.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
  • I have used JS check this link that encrypts the data using sha1 but what if Js is disabled? [link]http://area51.phpbb.com/phpBB/viewtopic.php?f=108&t=33024 – shubhraj Dec 06 '13 at 04:57