1

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:

$result = get_web_page('THE_WEB_PAGE');

preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);

foreach ($matches[2] as $lnk) {
    $result = get_web_page($lnk);

    preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test'] = $match[1];

    preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test2'] = $match[1];

    preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test3'] = $match[1];
    ++$index;
}

I have some more preg_match calls inside the loop.
How can I optimize my code?

Edit:

I've changed my code to use xpath instead of regex, and it became much more slower.

Edit2:

That's my full code:

    <?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');

$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);

// Get the links
$matches = $xpath->evaluate('//li[@class = "lasts"]/a[@class = "lnk"]/@href | //li[@class=""]/a[ @class = "lnk"]/@href');
if ($matches === FALSE) {
    echo 'error';
    exit();
}
foreach ($matches as $match) {
    $links[] = 'WEB_PAGE'.$match->value;
}

$index = 0;

// For each link
foreach ($links as $link) {
    echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
    $result = get_web_page($link);

    $dom = new DOMDocument();
    $dom->loadHTML($result['content']);
    $xpath = new DOMXPath($dom);

    $match = $xpath->evaluate('concat(//span[@id = "header"]/span[@id = "sub_header"]/text(), //span[@id = "header"]/span[@id = "sub_header"]/following-sibling::text()[1])');
    if ($matches === FALSE) {
        exit();
    }
    $data[$index]['name'] = $match;

    $matches = $xpath->evaluate('//li[starts-with(@class, "active")]/a/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['types'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is a title" and @class = "info"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['info'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is another title" and @class = "name"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['names'][] = $match->data;
    }

    ++$index;
}

?>
Lior
  • 5,841
  • 9
  • 32
  • 46

2 Answers2

2

Consider using a DOM framework for PHP. This should be way faster.

Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php

See Jan's answer for more explanation.

The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/

an example to get all a tags on a page:

<?php
  include_once('simple_html_dom.php');

  $url = "http://your_url/";
  $html = new simple_html_dom();
  $html->load_file($url);

  foreach($html->find("a") as $link)
  {
    // do something with the link
  }
?>
Tim van Osch
  • 483
  • 4
  • 16
2

As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:

<?php

# set up some dummy data
$data = <<<DATA
<div>
    <a class='link'>Some link</a>
    <a class='link' id='otherid'>Some link 2</a>
</div>
DATA;

$dom = new DOMDocument();
$dom->loadHTML($data);

$xpath = new DOMXPath($dom);

# all links
$links = $xpath->query("//a[@class = 'link']");
print_r($links);

# special id link
$special = $xpath->query("//a[@id = 'otherid']")

# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>
Jan
  • 42,290
  • 8
  • 54
  • 79
  • I've changed my code to use xpath instead of regex as you recommended, and it became much more slower. – Lior Aug 05 '16 at 08:56
  • @Lior: You need to be more specific with the xpath queries then, ie `/div/span/p/a` instead of `//a`. I'd go for a more robust solution even it is somewhat slower (1-2 secs). – Jan Aug 05 '16 at 09:19
  • The thing is that it runs inside a loop foreach link that I get, so each iteration makes it even more slower. 0 loop 1.66981506348 1 loop 2.49688410759 2 loop 3.00950098038 3 loop 3.5253970623 4 loop 4.01076102257 5 loop 4.67162799835 6 loop 5.2378718853 7 loop 5.74008488655 8 loop 6.26041197777 9 loop 6.78747105598 10 loop 7.47332000732 11 loop 8.03243994713 12 loop 8.50538802147 13 loop 9.37472701073 14 loop 11.5049209595 15 loop 12.2112920284 ... 40 loop 30.2815680504 41 loop 31.1307020187 – Lior Aug 05 '16 at 11:36
  • @Lior: It probably does not need to run in a loop. Post your full code in the question. – Jan Aug 05 '16 at 11:48
  • Please show some html output, probably the queries can be combined or could be made relative. – Jan Aug 05 '16 at 17:40