Extract string in HTML page using scraping in PHP ad xpath

Question

I need to scrape this HTML page ...

https://www.sanita.puglia.it/monitorpo/aslfg/monitorps-web/monitorps/monitorPSperASL.do?codNazionale=160115

.... using PHP and XPath to get the value 10 in the green box under tha table named "PO G. TATARELLA-CERIGNOLA".

(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )

I'm using this PHP code sample to print the value ...

<?php
    ini_set('display_errors', 'On');
    error_reporting(E_ALL);

    $url = 'https://www.sanita.puglia.it/monitorpo/aslfg/monitorps-web/monitorps/monitorPSperASL.do?codNazionale=160115';

    $xpath_for_parsing = '/html/body/div[4]/table/tbody/tr[2]/td[4]/div';

    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $colorWaitingNumber = $xpath->query($xpath_for_parsing);
    $theValue =  'N.D.';
    foreach( $colorWaitingNumber as $node )
    {
      $theValue = $node->nodeValue;
    }

    print $theValue;
?>

In this way I obtain "N.D." as output not "10" as I suppose.

The page source code is the following ...

In my code I'd like don't use "absolute xpath" so, I've tried to use a syntax like (I know that it does not work but I'm quite a newbie with xpath ... )

$xpath_for_parsing = '//*[div="cRiga3 boxtriageS"]';

but the result is always the same.

Any suggestions / example?

score 1 · Answer 1 · answered Dec 02 '17 at 17:49

I think the following ought to help a little - you will need to tweak the XPath query perhaps to target a specific table and thus specific cell contents but the main code seems to work ok. The problem I suspect with the original code was that the url is https which usually requires additional configuration settings when making a curl request. There are settings in the curlrequest function which could be removed, I just copied from another script where I had these set.

Change the path to $cacert to a copy of cacert.pem on your system or to the live version on curl.haxx.se

$url = 'https://www.sanita.puglia.it/monitorpo/aslfg/monitorps-web/monitorps/monitorPSperASL.do?codNazionale=160115';

function _curlrequest( $url=null, $options=null ){
    $cacert='c:/wwwroot/cacert.pem';
    $vbh = fopen('php://temp', 'w+');


    $res=array(
        'response'  =>  null,
        'verbose'   =>  null,
        'info'      =>  array( 'http_code' => 100 ),
        'headers'   =>  null,
        'errors'    =>  null
    );
    if( is_null( $url ) ) return (object)$res;

    session_write_close();

    /* Initialise curl request object */
    $curl=curl_init();
    if( parse_url( $url,PHP_URL_SCHEME )=='https' ){
        curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, true );
        curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
        curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
    }

    /* Define standard options */
    curl_setopt( $curl, CURLOPT_URL,trim( $url ) );
    curl_setopt( $curl, CURLOPT_AUTOREFERER, true );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $curl, CURLOPT_FAILONERROR, true );
    curl_setopt( $curl, CURLOPT_HEADER, false );
    curl_setopt( $curl, CURLINFO_HEADER_OUT, false );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $curl, CURLOPT_BINARYTRANSFER, true );
    curl_setopt( $curl, CURLOPT_CONNECTTIMEOUT, 20 );
    curl_setopt( $curl, CURLOPT_TIMEOUT, 60 );
    curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' );
    curl_setopt( $curl, CURLOPT_MAXREDIRS, 10 );
    curl_setopt( $curl, CURLOPT_ENCODING, '' );

    curl_setopt( $curl,CURLOPT_VERBOSE,true );
    curl_setopt( $curl,CURLOPT_NOPROGRESS,true );
    curl_setopt( $curl,CURLOPT_STDERR,$vbh );

    /* Assign runtime parameters as options */
    if( isset( $options ) && is_array( $options ) ){
        foreach( $options as $param => $value ) curl_setopt( $curl, $param, $value );
    }

    /* Execute the request and store responses */
    $res=(object)array(
        'response'  =>  curl_exec( $curl ),
        'info'      =>  (object)curl_getinfo( $curl ),
        'errors'    =>  curl_error( $curl )
    );
    rewind( $vbh );
    $res->verbose=stream_get_contents( $vbh );
    fclose( $vbh );

    curl_close( $curl );
    return $res;
}

function getdom( $data=false, $debug=false ){
    try{
        if( !$data )throw new Exception('No data passed whilst trying to invoke DOMDocument');
        libxml_use_internal_errors( true );

        $dom = new DOMDocument();
        $dom->validateOnParse=false;
        $dom->standalone=true;
        $dom->strictErrorChecking=false;
        $dom->recover=true;
        $dom->formatOutput=false;
        $dom->loadHTML( $data );

        $errors=libxml_get_errors();
        libxml_clear_errors();

        return !empty( $errors ) && $debug ? $errors : $dom;

    }catch( Exception $e ){
        echo $e->getMessage();
    }
}



$obj=_curlrequest( $url );
if( $obj->info->http_code==200 ){

    $dom=getdom( $obj->response );
    $xp=new DOMXPath( $dom );


    $query='//div[ contains( @class,"cRiga3 boxtriageS" ) ]';
    $col=$xp->query( $query );

    if( !empty( $col ) && $col->length > 0 ){
        foreach( $col as $node )echo $node->nodeValue . '<br />';
    }
}

This outputs

Extract string in HTML page using scraping in PHP ad xpath

1 Answers1

Linked

Related