-2

I'm using a script, written in PHP and Jquery, that allows to scrape a static website:

<?php
if(isset($_GET['site'])){
  $f = fopen($_GET['site'], 'r');
  $html = '';
  while (!feof($f)) {
    $html .= fread($f, 24000);
  }
  fclose($f);
  echo $html;
}
?>

The Jquery part:

$(function(){
   var site = $(input).val();

   $.get('proxy.php', { site:site }, function(data){

      $('#myDiv').append(data);

   }, 'html');

});

As you can see the website that needs to be scraped has to be value in input. I want to give my visitors the ability to set there own website to be scraped.

The problem is that I cant figure out how to secure the PHP part. As I understand the input value is a big security risk because anything can be sent with value. I already experienced slow performance and several 'pc crashes' working with this code. Im not sure if the crashes are related but they only happen when I work on the code. Anyway I would really like to know how to validate the value(from input) sent to my server, only REAL urls should be aloud. I googled for days but I cant figure it out (new at PHP)

ps If you spot any other security risks please let me know..

Youss
  • 4,196
  • 12
  • 55
  • 109
  • What exactly are you hoping to achieve? What do you want to prevent? – Madara's Ghost Mar 08 '13 at 11:35
  • 2
    If you're allowing people to scrape arbitrary URLs, adding any kind of sensible security is going to be hell. You'd have to protect against flooding, timeouts, harmful responses (e.g. too big ones).... Why are you offering this as a public service in the first place? – Pekka Mar 08 '13 at 11:35
  • @Pekka Hi, I only need to validate the url, I think that will solve the biggest security risk. Of course I cant secure 100% – Youss Mar 08 '13 at 11:37
  • @Youss an invalid URL will just time out. That seems like the *smallest* security risk to me to be honest :) you won't have any luck trying to validate a URL anyway. Whether a URL is actually real, you will find out when you make a request to it. – Pekka Mar 08 '13 at 11:39
  • @Pekka Are you sure there is no 'universal basic security settings' out there? That will at least cover the main issues..? – Youss Mar 08 '13 at 11:45
  • @Youss nope. The only universal basic security here is not to offer this in the first place. You can use `parse_url()` to see whether it's a valid URL at all, but that won't really help you in terms of security. – Pekka Mar 08 '13 at 11:47
  • @Pekka Isn't this pretty much the 'universal basic security settings' I was talking about: http://www.w3schools.com/php/php_filter.asp – Youss Mar 08 '13 at 12:23
  • @Youss nope. they are for cases where, say, you have a form, and a form field that should contain only numbers. Or only a certain amount of characters. And so on. They have nothing to do with scraping arbitrary web pages. – Pekka Mar 08 '13 at 12:26
  • @Pekka Well my question was also about validating urls. I mean: The only thing that a user can make use of is in this case a single input. If the user enters something else than an url (say malware code) nothing happens. If the user enters a real url then the website gets scraped. It doesnt matter what the website is about, its not about the website. The user can have the website in a browser and he should protect himself using antivirus software, which is beyond my business model – Youss Mar 08 '13 at 12:33
  • @Youss there's plenty of bad things that one can do to your server through this without having to execute a virus (see the list above). Validating URLs beforehand will do absolutely zero for security. A URL like `http://exa1mple.com/dfdafasdfdsaf` is perfectly valid, yet it will time out when you make a request to it. – Pekka Mar 08 '13 at 12:35
  • @Pekka Doesn't that go for ANY php code ever written in this world which includ forms, inputs etc..? How do your comments specifically apply only to screen scraping code? (Please stick around:) Im just trying to understand as it could be very important to my business) – Youss Mar 08 '13 at 12:58
  • @Youss you're partly right - any PHP page could be hit with flood attacks for example, that's true. However, forms inputs and such you can sanitize properly. Allowing requests to arbitrary URLs brings additional danger. For example, I could use your proxy script to stream illegal content from a remote server, leaving your IP the culprit if anything ever goes wrong. I could use it to make requests to huge resources, costing you resources and traffic. I could start a hundred requests to invalid resources, which might bring your server down. – Pekka Mar 08 '13 at 13:05
  • It's possible - even likely - that nothing bad ever happens to you because no one discovers the possibilities, but you *are* leaving an unattended open door – Pekka Mar 08 '13 at 13:06
  • @Pekka Ofcourse there is a lot more to my code, Im not showing everything. For instance Im aware of overload requests thats wy I will set a limit to request per second. Im still in the proccess of building my code (one step at a time..) Another example: if someone is making illigal request to company X and company X contacts me about this I will immediatly cut this user of (by ip-adress) or whatever. Anyway, I understand your point of view now, Thanks(: – Youss Mar 08 '13 at 13:10

2 Answers2

1

I think your main security issue, is that you're using fopen to read the content of the url, if the user wants to read a file in your system, then he has to send the path to that file and if the script has enough permissions, then they'll be able to access the content of your hard drive.

I would recommend using other methods like Curl or at least, validating the user input to make sure that it's a valid url, for this, I would check out some regular expressions

Good luck with your code!

Edit on validation

Here is a little example of what I meant by validation.

<?php
if(isset($_GET['site'])){
  if(validURL($_GET['site']) {
     $f = fopen($_GET['site'], 'r');
     $html = '';
     while (!feof($f)) {
       $html .= fread($f, 24000);
     }
     fclose($f);
     echo $html;
  } else {
     echo "Invalid URL, please enter a valid web url (i.e: http://www.google.com)";
  }
}

function validURL($url){ 
   //here goes your validation code, returns true if the url is valid
}
?>

But if you're too new to understand this, I would suggest going for simpler examples, since this is pretty basic logic.

Community
  • 1
  • 1
Deleteman
  • 8,500
  • 6
  • 25
  • 39
  • Sorry Im new at php..doesnt function validURL($url) have to be on top of the code rather than below? – Youss Mar 08 '13 at 11:50
  • @Youss No it doens't, but I would write the function on another file and include that file on the first line of your script, to keep things cleaner. – Deleteman Mar 08 '13 at 11:56
0

Its so sad that you could not find anything on the internet about this topic. Its a common thing. Please refer the links below. It may be of help.

PHP validate input alphanumeric plus a few symbols

http://phpmaster.com/input-validation-using-filter-functions/

Community
  • 1
  • 1
GBRocks
  • 698
  • 8
  • 30
  • I did upvote, you had +1...Someone downvoted you...Im clicking the +1 but it doesnt work I the folowing message; You last voted on this answer 32 mins ago Your vote is now locked in unless this answer is edited – Youss Mar 08 '13 at 12:13
  • These filters have nothing to do with scraping external web sites. – Pekka Mar 08 '13 at 12:26