2

i'm trying to scan prices from THIS page, i want to use this preg_match to extract prices from this div: <span class="price"><b>519,00&nbsp;€</b></span>. What is the correct preg_match?

This is my extractor script:

<?php
echo  "funziona!";

    if(!$fp = fopen("https://www.google.it/webhp?sourceid=chrome-instant&ion=1&espv=2&es_th=1&ie=UTF-8#tbs=vw:l,mr:1&tbm=shop&q=samsung+galaxy+note+4&tbas=0" ,"r" )) {
        return false;
    } //our fopen is right, so let's go
    $content = "";

    while(!feof($fp)) { //while it is not the last line, we will add the current line to our $content
        $content .= fgets($fp, 1024);
    }
    fclose($fp); //we are done here, don't need the main source anymore
?>

<?php
//our fopen, fgets here

//our magic regex here
preg_match_all('/<span class=\"price">(.*?)<\/span>/s',$content, $prices); //THIS IS PREG_MATCH 
    echo $prices[0][0]."<br />";
?>

I have never used preg_match before, i'm trying to adapt this script.
Thank you.

leofabri
  • 141
  • 12
  • What happens with the current code? You don't need to escape the double quote, `\"`. You also want the first index, not the zero index of prices. – chris85 Jul 10 '15 at 18:41
  • This should print prices from webpages, but there are errors. The full code is in this guide http://www.1stwebdesigner.com/php-crawler-tutorial/ – leofabri Jul 10 '15 at 18:43
  • 2
    there is no "correct" preg. regexes + html = bad idea. use a DOM parser. – Marc B Jul 10 '15 at 18:43
  • There are guides on stack overflow? I haven't found it :( – leofabri Jul 10 '15 at 18:47
  • Ok i found This http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – leofabri Jul 10 '15 at 18:48
  • Thank you for the advice :) – leofabri Jul 10 '15 at 18:48
  • 1
    That content is loaded via javascript. Check the content your regexing agaisnt in the PHP. Is "price" there? You will have the same issue when trying with a parser, you need the content to be correct first. – chris85 Jul 10 '15 at 18:49
  • And what should i use to get prices? – leofabri Jul 10 '15 at 18:53
  • 1
    I'm not sure how it is populated. You'll need to research that and have a way to make the PHP emulate that process. – chris85 Jul 10 '15 at 18:55
  • Because i can use even this site to find prices. Maybe is better? http://www.trovaprezzi.it/Fprezzo_tablet_samsung_t805_galaxy_tab_s_10_5_16gb_4g.aspx – leofabri Jul 10 '15 at 18:59
  • That page should work, element is different though. `` – chris85 Jul 10 '15 at 19:08
  • Ok, but i need to use preg_match? I have never tried anything like this before – leofabri Jul 10 '15 at 19:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82988/discussion-between-chris85-and-leofabri). – chris85 Jul 10 '15 at 19:15

2 Answers2

1

You should use a parser, not a regex to accomplish this task. Here's a sample of how this could be done using the simple html dom parser.

include_once 'simple_html_dom.php';
$html = file_get_html('http://www.example.com');
foreach($html->find('span') as $element) {
    if(strpos($element->class, 'price')){
        echo $element->innertext . "\n";
    }
}

This also is a pretty loose check you may get back more results than you want. It only is checking that the span's class contains the word price.

http://simplehtmldom.sourceforge.net/manual.htm#section_quickstart

Other approaches, How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
chris85
  • 23,846
  • 7
  • 34
  • 51
1

Have a look at this:

<?php
function getUrl($Url,$Options = array(),&$optOut = array())
{

    $CURL_DEFAULT_SETTINGS  = array
    (
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_AUTOREFERER => true,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_MAXREDIRS => 10,
        CURLOPT_TIMEOUT => 10,
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8'
    );

    if (!($ch = curl_init($Url)))
        throw new Exception("Couldn't initialize cURL library",100);

    if (is_array($CURL_DEFAULT_SETTINGS) && count($CURL_DEFAULT_SETTINGS) > 0)
        curl_setopt_array($ch,$CURL_DEFAULT_SETTINGS);

    if (is_array($Options) && count($Options) > 0)
    {
        foreach ($Options as $k => $v)
        {
            curl_setopt($ch,$k,$v);
        }
    }

    $Data = curl_exec($ch);
    $Error = curl_error($ch);

    $optOut['CURLINFO_HEADER_OUT'] = curl_getinfo($ch, CURLINFO_HEADER_OUT );

    curl_close($ch);

    if (!$Data)
    {
        if ($Error)
            throw new Exception($Error);

        return false;
    }

    return $Data;
}

function getPriceFor($query) {
    $data = getUrl('https://www.google.it/search?tbs=vw:l,mr:1&tbm=shop&q='.rawurlencode($query).'&tbas=0&bav=on.2,or.&cad=b&fp=6a24b60e09fe0b18&biw=1196&bih=703&dpr=2&ion=1&espv=2&tch=1&ech=1&psi=byWgVee9A4TNeIXRgLAK.1436558704099.3');
    $data = '['.preg_replace('/\/\*""\*\//msi',',',preg_replace('/\/\*""\*\/[\s]*$/msi','',$data)).']';
    $data = json_decode($data,true);
    preg_match_all('/<div[\s]+class="_OA"><div><b>([^<]+)[\s]*<\/b><\/div><div>([^<]+)<\/div><\/div>/msi',$data[3]['d'],$res);

    $re = array();

    foreach ($res[1] as $k=>$r)
        $re[] = array('price'=>$r,'from'=>$res[2][$k]);

    return $re;
}

print_r(getPriceFor('samsung galaxy note 4'));

That must display something like this:

Array
(
    [0] => Array
        (
            [price] => 515,00 €
            [from] => phoneshopping.it
        )

    [1] => Array
        (
            [price] => 519,00 €
            [from] => Smartyrama
        )

    [2] => Array
        (
            [price] => 519,00 €
            [from] => Smartyrama
        )

    [3] => Array
        (
            [price] => 519,00 €
            [from] => Smartyrama
        )

    [4] => Array
        (
            [price] => 690,45 €
            [from] => Amazon.it - Seller
        )

    [5] => Array
        (
            [price] => 673,99 €
            [from] => da 2 negozi
        )

    [6] => Array
        (
            [price] => 345,00 €
            [from] => da 2 negozi
        )

    [7] => Array
        (
            [price] => 342,00 €
            [from] => Amazon.it - Seller
        )

    [8] => Array
        (
            [price] => 699,99 €
            [from] => ePRICE.it
        )

    [9] => Array
        (
            [price] => 730,00 €
            [from] => in oltre 5 negozi
        )

    [10] => Array
        (
            [price] => 20,00 €
            [from] => Amazon.it - Seller
        )

    [11] => Array
        (
            [price] => 208,99 €
            [from] => eGlobal Central Italia
        )

    [12] => Array
        (
            [price] => 711,00 €
            [from] => in oltre 5 negozi
        )

    [13] => Array
        (
            [price] => 322,99 €
            [from] => eGlobal Central Italia
        )

    [14] => Array
        (
            [price] => 40,09 €
            [from] => da 4 negozi
        )

    [15] => Array
        (
            [price] => 15,99 €
            [from] => acadattatore.com
        )

    [16] => Array
        (
            [price] => 339,99 €
            [from] => ePRICE.it
        )

    [17] => Array
        (
            [price] => 412,90 €
            [from] => da 3 negozi
        )

    [18] => Array
        (
            [price] => 343,33 €
            [from] => Amazon.it - Seller
        )

    [19] => Array
        (
            [price] => 629,00 €
            [from] => BestPriceStore
        )

)
tin
  • 834
  • 6
  • 16
  • Thank you tin and Chris, i really appreciate your support. Tin, when i try your code i get this error: Fatal error: `Uncaught exception 'Exception' with message 'SSL certificate problem: unable to get local issuer certificate' in C:\xampp\htdocs\index.php:41 Stack trace: #0 C:\xampp\htdocs\index.php(50): getUrl('https://www.goo...') #1 C:\xampp\htdocs\index.php(63): getPriceFor('samsung galaxy ...') #2 {main} thrown in C:\xampp\htdocs\index.php on line 41 ` – leofabri Jul 11 '15 at 07:35
  • 1
    I see you're using windows. You're gonna have to to set a ssl certificate for curl or use file_get_contents instead of the getUrl function I'm calling. I could give you further instructions in a few hours. – tin Jul 11 '15 at 07:43
  • Oh ok, your program works very well in ubuntu. Yes, initially i have used your code on XAMPP machine, but since i want to use it on Ubuntu based machine, i don't need to modify the code for windows. I really thank you for your support, you have been clear and direct. – leofabri Jul 11 '15 at 07:54
  • One last question, how have you structured the url for scanning? if I want to find price for another product how should I do? – leofabri Jul 11 '15 at 08:27
  • 1
    Vote up and mark your question as answered. That's the best way to give thanks. – tin Jul 11 '15 at 09:17
  • Sorry for my question, but if I want scan even images urls, titles and urls of the sellers using preg match all, how should be the pattern? – leofabri Jul 14 '15 at 15:46