Scraping HTML content, preg_match not working

Question

Currently I'm scraping data from a page of HTML. One of my code is not working. The HTML content has something like this.

<ul class="pagination">
    <li>
        <span class="page active">
            1
        </span>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars1">
            2
        </a>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars2">
            3
        </a>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars3">
        4
        </a>
    </li>
<ul>

i tried this code to get the href value next to the active page link, like in the example the active page link is page number 1 so the href value that i will get must be page number 2 where the value is /somethingherewithanychars1 but it is not working

$file_string = file_get_contents($url); 
 preg_match('/<li><span class="page active">.*?<\/span><\/li><li><a class="page available" href="(.*)">/i', $file_string, $pages); 

print_r($pages);

The html that i was accessing has some code like this

<div class="attributes">
   <a class="name" href="/linksTothissite" data-hovercard-id="somechars">link1</a>
   <span class="list">
    USA
   </span>
   <a class="name" href="/linksTothissite" data-hovercard-id="somechars">link2</a>
   <span class="list">
    CANADA
   </span>
</div>

I tried getting the values using this code, and i can get the link1 and link2

preg_match_all('/<a class="name" href=".*?" data-hovercard-id=".*?">(.*)<\/a>/i', $file_string, $values);

also this one i can get the USA, and CANADA

 preg_match_all('/<span class="list">(.*?)<\/span>/s',$file_string, $values); 
         $val= $values[1];

Why is my preg_match not getting the value i need? i tried using also pre_match_all() but still i get an output in my print_r Array ( ), but the rest of my code works.

Regex is only suitable to such tasks if you know what you're doing. Seeing that your preg_match doesn't even account for the whitespace between `
i see. i tried adding on my code to ignore whitespace and newline but still not working. also the other code have newline but the i can get the values of it. — Snippet, Sep 01 '13 at 16:36
@Cobra_Fast Please don't post links to that question, because they are not helpful to the reader, unless you follow it up with something that is an answer they can use. *You* know the point of the comment and that wall of text is that parsing HTML with regexes is a bad idea. However, to someone else who is asking, that is not at all clear. Worse, it doesn't point the reader to any useful solutions that *can* help parse HTML reliably. — Andy Lester, Sep 01 '13 at 21:02
@AndyLester The post I've linked ends with "Have you tried using an XML parser instead?"... — Cobra_Fast, Sep 01 '13 at 22:05
@Cobra_Fast: I suspect that that gets lost amidst the mass of unreadable text. If you're going to suggest an XML parser, why not simply make the comment "Have you tried using an XML parser instead?" which is far more likely to actually be read by the OP. — Andy Lester, Sep 02 '13 at 15:48
@AndyLester Because posting that answer is way more entertaining. I'd prefer to not discuss the controversy of the answer I've linked here since it already spawned extensive meta threads with good arguments for all different viewpoints... — Cobra_Fast, Sep 02 '13 at 15:54
@Cobra_Fast: I prefer to go with helpful over entertaining to the commenters. I'm sure that those we are here to help do, too. — Andy Lester, Sep 03 '13 at 03:31

Casimir et Hippolyte · Accepted Answer · 2013-09-01T18:34:07.900

A good way to do that is to use the DOM combined with XPath as wrote Prix.

If you want to check that the link you are looking for is a child element of an item from an unordered list with the class "pagination", and to check that the item is the next after the "active page" item, the query will be a little complicated.

$doc = new DOMDocument();
@$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$xquery = '//ul[@class="pagination"]'                    // ul with the "pagination" class
        . '/li[descendant::span[@class="page active"]]'  // li that contains a span with "page active" class
        . '/following-sibling::*[1]'                     // next sibling (next li)
        . '/a/@href';                                    // href attribute of the a tags
$links = $xpath->query($xquery);
echo $links->item(0)->value;

The reasons why your regex doesn't work are:

You have forgotten all the possible white spaces (spaces, tabs, newlines) between tags.
You use the dot to describe possible characters between tags that can't match newlines
Not fatal here but : You use a greedy quantifier (.*)" to describe the link (result: the regex engine will take the last double quote of the line, not the first it meets.)

After adding \s* when it's necessary, you can replace .* and .*? by negated character classes:

preg_match('/<li>\s*<span class="page active">[^<]+<\/span>\s*<\/li>\s*<li>\s*<a class="page available" href="([^"]+)">/i', $file_string, $pages);

Keep in mind that the smallest change in your html code can make your pattern to fail when the DOM method will always work (as long as the tree structure remains the same)

thanks for the brief explanation. been trying this DOM but i can't figure how to use the query for it.I've tried the dom and the preg_match you given and both are working. Im interested with this DOM.can i ask for some links or documentation about the dom xquery. :) — Snippet, Sep 01 '13 at 19:49
You can find all about DOMDocument and DOMXPath in the PHP manual: http://www.php.net/manual/en/class.domxpath.php , http://www.php.net/manual/en/class.domdocument.php And you can find some tutorials about how to use xpath with php like: http://www.ibm.com/developerworks/library/x-xpathphp/ or using your favorite search engine — Casimir et Hippolyte, Sep 01 '13 at 20:29

Scraping HTML content, preg_match not working

1 Answers1