Optimize remote page retrieving and parsing

Question

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:

$result = get_web_page('THE_WEB_PAGE');

preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);

foreach ($matches[2] as $lnk) {
    $result = get_web_page($lnk);

    preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test'] = $match[1];

    preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test2'] = $match[1];

    preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test3'] = $match[1];
    ++$index;
}

I have some more preg_match calls inside the loop.
How can I optimize my code?

Edit:

I've changed my code to use xpath instead of regex, and it became much more slower.

Edit2:

That's my full code:

    <?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');

$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);

// Get the links
$matches = $xpath->evaluate('//li[@class = "lasts"]/a[@class = "lnk"]/@href | //li[@class=""]/a[ @class = "lnk"]/@href');
if ($matches === FALSE) {
    echo 'error';
    exit();
}
foreach ($matches as $match) {
    $links[] = 'WEB_PAGE'.$match->value;
}

$index = 0;

// For each link
foreach ($links as $link) {
    echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
    $result = get_web_page($link);

    $dom = new DOMDocument();
    $dom->loadHTML($result['content']);
    $xpath = new DOMXPath($dom);

    $match = $xpath->evaluate('concat(//span[@id = "header"]/span[@id = "sub_header"]/text(), //span[@id = "header"]/span[@id = "sub_header"]/following-sibling::text()[1])');
    if ($matches === FALSE) {
        exit();
    }
    $data[$index]['name'] = $match;

    $matches = $xpath->evaluate('//li[starts-with(@class, "active")]/a/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['types'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is a title" and @class = "info"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['info'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is another title" and @class = "name"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['names'][] = $match->data;
    }

    ++$index;
}

?>

Asking for trouble when using regex to parse HTML. (Refer to Answer by @Tim van Osch) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Richard Christensen, Aug 04 '16 at 20:36
http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php — AbraCadaver, Aug 04 '16 at 20:46
How are you going to get expected results while using greedy quantifiers at the first place? — revo, Aug 04 '16 at 21:24
@revo What do you mean? I am getting the expected results... — Lior, Aug 04 '16 at 21:37
Well I doubt unless you put some examples of what `$result['content']` can hold. — revo, Aug 04 '16 at 21:41

Tim van Osch · Answer 1 · 2016-08-04T21:31:09.763

2

Consider using a DOM framework for PHP. This should be way faster.

Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php

See Jan's answer for more explanation.

The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/

an example to get all a tags on a page:

<?php
  include_once('simple_html_dom.php');

  $url = "http://your_url/";
  $html = new simple_html_dom();
  $html->load_file($url);

  foreach($html->find("a") as $link)
  {
    // do something with the link
  }
?>

edited Aug 04 '16 at 21:31

answered Aug 04 '16 at 20:27

Tim van Osch

483
4
16

No need for an external library. – revo Aug 04 '16 at 21:26
Note that simple_html_dom isn't so simple and that its source code makes a massive use of regex. – Casimir et Hippolyte Aug 04 '16 at 21:26
... and it consumes your memory exponentially. – revo Aug 04 '16 at 21:29
Your answer is a different way and doesn't need to be suppressed, It's good to know too. – Casimir et Hippolyte Aug 04 '16 at 21:29
For the reference: +1 – Jan Aug 05 '16 at 06:39

score 2 · Answer 2 · answered Aug 04 '16 at 20:52

2

As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:

<?php

# set up some dummy data
$data = <<<DATA
<div>
    <a class='link'>Some link</a>
    <a class='link' id='otherid'>Some link 2</a>
</div>
DATA;

$dom = new DOMDocument();
$dom->loadHTML($data);

$xpath = new DOMXPath($dom);

# all links
$links = $xpath->query("//a[@class = 'link']");
print_r($links);

# special id link
$special = $xpath->query("//a[@id = 'otherid']")

# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>

answered Aug 04 '16 at 20:52

Jan

42,290
8
54
79

I've changed my code to use xpath instead of regex as you recommended, and it became much more slower. – Lior Aug 05 '16 at 08:56
@Lior: You need to be more specific with the xpath queries then, ie `/div/span/p/a` instead of `//a`. I'd go for a more robust solution even it is somewhat slower (1-2 secs). – Jan Aug 05 '16 at 09:19
The thing is that it runs inside a loop foreach link that I get, so each iteration makes it even more slower. 0 loop 1.66981506348 1 loop 2.49688410759 2 loop 3.00950098038 3 loop 3.5253970623 4 loop 4.01076102257 5 loop 4.67162799835 6 loop 5.2378718853 7 loop 5.74008488655 8 loop 6.26041197777 9 loop 6.78747105598 10 loop 7.47332000732 11 loop 8.03243994713 12 loop 8.50538802147 13 loop 9.37472701073 14 loop 11.5049209595 15 loop 12.2112920284 ... 40 loop 30.2815680504 41 loop 31.1307020187 – Lior Aug 05 '16 at 11:36
@Lior: It probably does not need to run in a loop. Post your full code in the question. – Jan Aug 05 '16 at 11:48
Please show some html output, probably the queries can be combined or could be made relative. – Jan Aug 05 '16 at 17:40

Optimize remote page retrieving and parsing

2 Answers2