Scrape HTML Page that redirects to itself using Curl PHP

Question

So i'm trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do

it seems that my code can't get the whole page html code , it acts very wierd.

I've tried with simple html dom, but nothing works.

    $base = "http://www.asx.com.au/asx/statistics/todayAnns.do";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_URL, $base);
    curl_setopt($curl, CURLOPT_REFERER, $base);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $str = curl_exec($curl);
    curl_close($curl);
    echo htmlspecialchars($str);

This shows mostly javascript and i can't get the page. My goal is to scrape that middle table on the url.

score 1 · Accepted Answer · answered Jun 27 '17 at 10:37

If you don't need the most recent data then you can use the cached version of the page from Google.

<?php

use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler(
    'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
    [
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Headline',
                'xpath' => './/td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Published',
                'xpath' => './/td[1]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Pages',
                'xpath' => './/td[4]',
            ]
        ),
        new \Scraper\Structure\AnchorField(
            [
                'name'               => 'Link',
                'xpath'              => './/td[5]/a',
                'convertRelativeUrl' => false,
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Code',
                'xpath' => './/text()',
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

I was able to get the following data using above code.

Array
(
    [0] => Array
        (
            [Code] => ASX
            [hash] => 6e16c02b10a10baf739c2613bc87f906
        )

    [1] => Array
        (
            [Headline] => Initial Director's Interest Notice
            [Published] => 10:57 AM
            [Pages] => 1
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
            [Code] => STO
            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
        )

    [2] => Array
        (
            [Headline] => Becoming a substantial holder
            [Published] => 10:53 AM
            [Pages] => 2
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
            [Code] => AKG
            [hash] => f8ff8dfde597a0fc68284b8957f38758
        )

    [3] => Array
        (
            [Headline] => LBT Investor Conference Call Business Update
            [Published] => 10:53 AM
            [Pages] => 9
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
            [Code] => LBT
            [hash] => cc78f327f2b421f46036de0fce270a6d
        )

...

Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful :)

Amazing! That's exactly what i was looking for! – EchO Jun 27 '17 at 18:29 — EchO, Jun 27 '17 at 18:29

bhar1red · Answer 2 · 2017-06-16T02:53:43.323

0

CURL can load only markup of the page. The above page uses javascript to load data after page has been loaded. You might have to use PhantomJS or Splash.

This link might help : https://stackoverflow.com/a/20554152/3086531

For fetching data, on serverside, We can use phantomjs as library inside PHP. Execute page inside phantomjs, then dump data into php using exec command.

This article has step-by-step process to do it. http://shout.setfive.com/2015/03/30/7817/

edited Jun 16 '17 at 02:53

answered Jun 16 '17 at 02:03

bhar1red

440
3
10

i was hoping for a php library... i need this in real time taken by the server. – EchO Jun 16 '17 at 02:15
@SilverSkin You can use phantomJS library inside PHP. This article might help : http://shout.setfive.com/2015/03/30/7817/ – bhar1red Jun 16 '17 at 02:50
i don't have access to install phantomJS on the server that i'm working. there is no other alternative? – EchO Jun 16 '17 at 03:22

Scrape HTML Page that redirects to itself using Curl PHP

2 Answers2