0

I have two tables "product_title" having product titles and "product_list_url" having url of product list pages. I want to fetch title from product title table and fetch and go to each url one by one and search for same title on page and if found, extract some data and save into Database.

I want to check each product on each url. but could not get required result.

here is my code..

<?php
set_time_limit(0);
$dbhost = "localhost";
$dbuser = "root";
$dbpass = "";
$dbname = "crawl";
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ("Error connecting to database");
mysql_select_db($dbname, $conn);
$q1 = mysql_query("SELECT * FROM `product-title`");
$q2 = mysql_query("SELECT * FROM `product-list-url`");
while($res1 = mysql_fetch_assoc($q1)){
    $product_title = $res1['title'];
    while($res2 = mysql_fetch_assoc($q2)){
        $url = $res2['url'];
        $html = file_get_contents($url);        
        $doc = new DOMDocument(); 
        @$doc->loadHTML($html);     
        $xpath = new DOMXPath($doc);
        $found = $xpath->evaluate("boolean(//span[contains(text(), '$product_title')])");
        if($found == false){
            echo "Not Found";       
        }

        else{ 
            $elements = $xpath->evaluate("//span[contains(text(), '$product_title' )]/following-sibling::div/span[@class='list_sale_price']");
            if (!is_null($elements)) {
                foreach ($elements as $element) {
                    $nodes = $element->childNodes;
                    foreach ($nodes as $node) {
                        $price = $node->nodeValue;
                        $price1 = preg_replace('/[^0-9-.]/','',$price);
                        $date = date('y-m-d');
                        mysql_query("INSERT INTO `prices` (`ptitle`, `price`, `date`) VALUES ('$product_title', '$price1', '$date')") or die(mysql_error());                    
                    }
                }
            }   
        }
    }
} 

?>

it insert only one record for 1st title.

Dev001
  • 276
  • 1
  • 7
  • 25
  • Your query will take the first record from product-titles and search all the values from product-list-url. However it does not reset the result sets pointer back to the first record, so for every other record on product-titles the fetch of a record on product-list-url will not return any records. Try adding `mysql_data_seek($q2, 0);` after the loop around product-list-url to reset the pointer back to the first record – Kickstart May 21 '14 at 09:21
  • That said, it would be a lot faster to read all the product-title records into memory, the get each record from product-list-url once and check the contents against each stored record. – Kickstart May 21 '14 at 09:23
  • it displays error "The connection has timed out The server at localhost is taking too long to respond." i have 4 urls and 10 product titles in my db. – Dev001 May 21 '14 at 09:54
  • That it is slow doesn't surprise me (it is doing 40 grabs of XML documents and processing them into a readable format), but surprised it is timing out. How large are the XML documents? What is the time limit for the script? Are the urls external and could you post the test data so we can investigate a bit more? – Kickstart May 21 '14 at 10:15
  • i am not using XML documents. I am working on external urls that uses html i have set the set_time_limit(0); – Dev001 May 22 '14 at 14:57
  • How large are the documents, and how complex? I am not experienced with DOMDocument / DOMXPath, but with experience of working with large XML docs (similar processing internally I expect) these can be slow. 40 slow documents could take ages (hence my 2nd suggestion below - which reduces it to only having to process each document once rather than having to process each one 10 times). Try running it as a CRON job and outputting something so you can see progress (eg, maybe echo out the sql insert statement each time) – Kickstart May 22 '14 at 15:04
  • As i'm using product urls and each url have 50 to 80 product items on which searching is applied. this code is only for 3 product list urls. – Dev001 May 23 '14 at 06:24
  • but actually i need to crawl all the pages... – Dev001 May 23 '14 at 06:25
  • But at the moment you grab the same document multiple times. For every product title you retrieve and process all the urls. It would be faster to retrieve each url once, and once retrieved you check it for all the titles. – Kickstart May 23 '14 at 08:08

2 Answers2

0

First of all, The use of mysql_ is bad since it'll be deprecated. On topic, I think the answer you're looking for JOIN. Your table product-title must have some sort of unique ID to each products right? I am assuming that your product-list-url table also have a matching ID with the product in the first table. Since I don't have any table structure I will give you examples.

SELECT t1.title
     , t2.url
FROM `product-title` t1
JOIN `product-list-url` t2 ON t1.id = t2.id
Community
  • 1
  • 1
John Guan
  • 744
  • 2
  • 11
  • 26
0

Just resetting the pointer might allow you to process records after the first product-title. Not tested, nor checked your document processing.

<?php
set_time_limit(0);
$dbhost = "localhost";
$dbuser = "root";
$dbpass = "";
$dbname = "crawl";
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ("Error connecting to database");
mysql_select_db($dbname, $conn);
$q1 = mysql_query("SELECT * FROM `product-title`");
$q2 = mysql_query("SELECT * FROM `product-list-url`");
while($res1 = mysql_fetch_assoc($q1))
{
    $product_title = $res1['title'];
    while($res2 = mysql_fetch_assoc($q2))
    {
        $url = $res2['url'];
        $html = file_get_contents($url);        
        $doc = new DOMDocument(); 
        @$doc->loadHTML($html);     
        $xpath = new DOMXPath($doc);
        $found = $xpath->evaluate("boolean(//span[contains(text(), '$product_title')])");
        if($found == false)
        {
            echo "Not Found";       
        }

        else
        { 
            $elements = $xpath->evaluate("//span[contains(text(), '$product_title' )]/following-sibling::div/span[@class='list_sale_price']");
            if (!is_null($elements)) {
                foreach ($elements as $element) {
                    $nodes = $element->childNodes;
                    foreach ($nodes as $node) {
                        $price = $node->nodeValue;
                        $price1 = preg_replace('/[^0-9-.]/','',$price);
                        $date = date('y-m-d');
                        mysql_query("INSERT INTO `prices` (`ptitle`, `price`, `date`) VALUES ('$product_title', '$price1', '$date')") or die(mysql_error());                    
                    }
                }
            }   
        }
    }
    mysql_data_seek($q2, 0);
} 

?>

However it is probably more efficient to read the list of titles into memory, then process each file once, checking the contents many times. Once for each title.

Something like this:-

<?php
set_time_limit(0);
$dbhost = "localhost";
$dbuser = "root";
$dbpass = "";
$dbname = "crawl";
$conn = mysql_connect($dbhost, $dbuser, $dbpass) or die ("Error connecting to database");
mysql_select_db($dbname, $conn);
$product_titles = array();
$q1 = mysql_query("SELECT * FROM `product-title`");
while($res1 = mysql_fetch_assoc($q1))
{
    $product_titles[] = $res1['title'];
}
$q2 = mysql_query("SELECT * FROM `product-list-url`");

while($res2 = mysql_fetch_assoc($q2))
{
    $url = $res2['url'];
    $html = file_get_contents($url);        
    $doc = new DOMDocument(); 
    @$doc->loadHTML($html);     
    $xpath = new DOMXPath($doc);
    foreach($product_titles as $product_title)
    {
        $found = $xpath->evaluate("boolean(//span[contains(text(), '$product_title')])");
        if($found == false)
        {
            echo "Not Found";       
        }

        else
        { 
            $elements = $xpath->evaluate("//span[contains(text(), '$product_title' )]/following-sibling::div/span[@class='list_sale_price']");
            if (!is_null($elements)) 
            {
                foreach ($elements as $element) 
                {
                    $nodes = $element->childNodes;
                    foreach ($nodes as $node) 
                    {
                        $price = $node->nodeValue;
                        $price1 = preg_replace('/[^0-9-.]/','',$price);
                        $date = date('y-m-d');
                        mysql_query("INSERT INTO `prices` (`ptitle`, `price`, `date`) VALUES ('$product_title', '$price1', '$date')") or die(mysql_error());                    
                    }
                }
            }   
        }
    }
}

?>
Kickstart
  • 21,403
  • 2
  • 21
  • 33