detect multiple urls in this web crawler

Question

I have written a web crawler in PHP and i gave it eBay to crawl. It grabs all the links in the given web page, but it sometimes can give multiple urls of the same link. It stresses me in my database and i don't know how to tweak the code.

   <?php

 session_start();

 $domain = "www.ebay.com";

  if(empty($_SESSION['page']))
  {
  $original_file = file_get_contents("http://" . $domain . "/");

 $_SESSION['i'] = 0;

  $connect = mysql_connect("xxxxxx", "xxxxxxxxxx", "xxxxxxxxxxxx");

  if (!$connect)
  {
  die("MySQL could not connect!");
  }

 $DB = mysql_select_db('xxxxxxxxxxxxx');

 if(!$DB)
 {
 die("MySQL could not select Database!");
 }
 }
 if(isset($_SESSION['page']))
 {

 $connect = mysql_connect("xxxxxxxxxxxxx", "xxxxxxxxxxxxx", "xxxxxxxxxxxx");

 if (!$connect)
 { 
 die("MySQL could not connect!");
 }

 $DB = mysql_select_db('xxxxxxxx');

  if(!$DB)
  {
  die("MySQL could not select Database!");
  }
  $PAGE = $_SESSION['page'];
  $original_file = file_get_contents("$PAGE");
   }

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is",   $stripped_file, $matches);

  foreach($matches[1] as $key => $value)
   {

  if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
  {
  $New_URL = "http://" . $domain . $value;
  }
  else
  {
  $New_URL = $value;
   } 
  $New_URL = addslashes($New_URL);
  $Check = mysql_query("SELECT * FROM pages WHERE url='$New_URL'");
  $Num = mysql_num_rows($Check);

  if($Num == 0)
  {
  mysql_query("INSERT INTO pages (url)
  VALUES ('$New_URL')");

  $_SESSION['i']++;

  echo $_SESSION['i'] . "";
  }
  echo mysql_error();
  }

  $RandQuery = mysql_query("SELECT DISTINCT * FROM pages ORDER BY RAND() LIMIT 0,1");
  $RandReturn = mysql_num_rows($RandQuery);
  while($row1 = mysql_fetch_assoc($RandQuery))
  {
  $_SESSION['page'] = $row1['url'];
  }
  echo $RandReturn;
  echo $_SESSION['page'];
  mysql_close();

   ?>

Do you have an index on field "url" in table "pages" in your mysql database? I assume you used a varchar(255) for the url? — Green Black, Aug 17 '12 at 19:17
And is there an index on "url"? If you have an index, your database will thank you, and be less stressed. — Green Black, Aug 17 '12 at 21:23
Please, don't use `mysql_*` functions for new code. They are no longer maintained and the community has begun the [deprecation process](http://goo.gl/KJveJ). See the [**red box**](http://goo.gl/GPmFd)? Instead you should learn about [prepared statements](http://goo.gl/vn8zQ) and use either [PDO](http://php.net/pdo) or [MySQLi](http://php.net/mysqli). If you can't decide, [this article](http://goo.gl/3gqF9) will help to choose. If you care to learn, [here is a good PDO tutorial](http://goo.gl/vFWnC). — PeeHaa, Aug 20 '12 at 16:30
Don't try to parse HTML with regex use a HTML parser: http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php — PeeHaa, Aug 20 '12 at 16:30

Shlomi Hassid · Answer 1 · 2013-03-28T18:31:01.407

First you have a slight problem with you link scraper:

You are using,

            if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
            {
                $New_URL = "http://" . $domain . $value;
            }
            else
            {
                $New_URL = $value;
            }

After striping all the tags.

The problem is that if the link HREF is like:

<a href='#' ...> or <a href='javascript:func()'> or <a href='img...'> etc...

It will prepare you an invalid URL that you don't need, you should strpos() or reg_match() for this unique cases (and some others) to escape them.

Also you need to consider escaping URL's that links to files such as: jpg, png, avi, wmv, zip etc...

Now for your question:

You need first to save all the URL's of the target page in an array, after doing that, you need to dump all the duplicate values in this array - that will minimize the amount of time the SQL query will consume...

A quick test using www.ebay.com:

before cleaning duplicate URL's: 196.
after cleaning: 120.

Now use:

SELECT EXISTS(SELECT 1 FROM table1 WHERE ...)

To check if the URL already exists in you data-base... its faster and more reliable.

Watch your code with my changes:

    $stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is",  $stripped_file, $matches);
$file_arr = array('#\.jpg$#i', '#\.png$#i', '#\.mpeg$#i', '#\.bmp$#i', '#\.gif$#i', '#\.wmv$#i', '#\.mp3$#i', '#\.avi$#i'); //add files to avoid them.
$avoid = 0; //check if it is a url that links to a file. [0-no, 1-yes]
foreach($matches[1] as $key => $value)
    {
        $value = preg_replace('/\#.*/i', '', $value); //removes pages position.
        if(strpos($value,"http://") != 'FALSE' && 
           strpos($value,"https://") != 'FALSE' && 
           strpos($value,"javascript") != 'FALSE' &&
           strpos($value,"javascript:func") != 'FALSE' &&
           $value != '')
            {
                foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1;  break 1; }  } //check all the file conditions
                $value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
                if ($avoid==0) { $New_URL[$key] = "http://" . $domain . $value . "/"; }
            }
            else
            {
             if(strpos($value,"javascript") != 'FALSE' &&
                strpos($value,"javascript:func") != 'FALSE' &&
                $value != '')
                {
                foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1;  break 1; }  }//check all the file conditions
                $value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
                if ($avoid==0) { $New_URL[$key] = $value . "/"; }
                }
            }
    }

    //check for duplicate before storing the URL:
    foreach($New_URL as $check)
    {
      $check = mysql_real_escape_string($check);
      $Check_ex = "SELECT EXISTS (SELECT 1 FROM pages WHERE url='$check' LIMIT 1)"; // EXISTS will RETURN 1 if exists ...
        if (@mysql_num_rows(mysql_query($Check_ex))!=1) {
                                                            //Insert your query here......
                                                        } 
                                                        else 
                                                        {
                                                            //Dont store your query......
                                                        }
    }

Not the cleanest code but it should work...

detect multiple urls in this web crawler

1 Answers1