Using regex to extract href descriptions based on criteria

Question

Possible Duplicate:
How to parse and process HTML with PHP?

I need to parse blocks of HTML, replacing some hrefs with the link description based on whether the description meets certain criteria.

The regex I'm using to identify specific strings is used elsewhere in my application:

$regex  = "/\b[FfGg][\.][\s][0-9]{1,4}\b/";
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

I'm using the following SO question as a starting point for extracting href descriptions:

Replacing html link tags with a text description

The idea is to convert any link having a "FfGg.xxxx" type identifier, and leave the rest in tact (ie, the google link).

What I have so far is:

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD 
show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in 
severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.
</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case 
reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a 
href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a 
href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" 
target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a 
href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a 
href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" 
target="F.96">F.96</a>);';

This converts all links, including google:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>(.*?)<\/a>/i", "$2", $html);

This returns a blank HTML string:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>[FfGg][\.][\s][0-9]{1,4}<\/a>/i", "$2", $html);

I believe the problem is in how I'm embedding this regex in the second (non-working) example above:

[FfGg][\.][\s][0-9]{1,4}

What is the correct way of embedding the FfGg expression in HTML found in my preg_replace example above?

I am not the downvoter, but the string `FfGg ` appears nowhere in your sample data. *edit* misread the regex, nevermind — DaveRandom, Sep 21 '12 at 14:30
...but, use DOM for this. Regex is not the tool for this job. — DaveRandom, Sep 21 '12 at 14:32
I've just noticed that your second `preg_replace()` regex doesn't have a second capture group in it. You forgot to put parenthesis around the link content. — DaveRandom, Sep 21 '12 at 14:39
The downvote is probably due to the fact that you are processing HTML with a regex, widely regarded to be a bad practice. In my opinion, though it isn't strictly fair to expect you to know that. Upvoted. — dan1111, Sep 21 '12 at 14:44

score 2 · Answer 1 · edited May 23 '17 at 11:49

You shouldn't be parsing HTML with a regex. You simply can't handle all of the cases correctly. Here are just a few examples of valid HTML that would break your link-finding regex:

<!-- <a href="www.blah.com">   -->    <a href="www.foo.com">F.100</a>
<area>...</area>  ...  <a href="www.foo.com">F.100</a>
<a href="www.foo.com">F.100</a >

I suggest taking a look at this question for better approaches: How do you parse and process HTML/XML in PHP?

score 2 · Accepted Answer · answered Sep 21 '12 at 15:09

Here is the DOM (correct) way to do it:

EDIT: Improved regex

<?php

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" target="F.96">F.96</a>);';

    // Create a new DOMDocument and load the HTML string
    $dom = new DOMDocument('1.0');
    $dom->loadHTML($html);

    // Create an XPath object for this DOMDocument
    $xpath = new DOMXPath($dom);

    // Loop over all <a> elements in the document
    // Ideally we would combine the regex into the XPath query, but XPath 1.0
    // doesn't support it
    foreach ($xpath->query('//a') as $anchor) {
        // See if the link matches the pattern
        if (preg_match('/^\s*[gf]\s*\.\s*\d{1,4}\s*$/i', $anchor->nodeValue)) {
            // If it does, convert it to a text node (effectively, un-linkify it)
            $textNode = new DOMText($anchor->nodeValue);
            $anchor->parentNode->replaceChild($dom->importNode($textNode), $anchor);
        }
    }

    // Because you are working with partial HTML string, I extract just that
    // string. If you are actually working with a full document, you can
    // replace all the code below this comment with simply:
    // $result = $dom->saveHTML();

    // A string to hold the result
    $result = '';

    // Iterate all elements that are a direct child of the <body> and convert
    // them to strings
    foreach ($xpath->query('/html/body/*') as $node) {
        $result .= $node->C14N();
    }

    // $result now contains the modified HTML string

See it working (NB: the error message you see is because the HTML string you supplied is not valid)

+1 - A regex is a great tool for dealing with HTML node text/attributes, *once you have parsed the HTML structure*. — Tim M., Sep 21 '12 at 15:20
Excellent example, and thanks to you and @TimMedora for helping clarify how best to use regex & DOM. — a coder, Sep 21 '12 at 17:53

score 1 · Answer 3 · answered Sep 21 '12 at 16:56

You shouldn't rely on reluctant quantifiers so much. They try to consume as few characters as possible, but they'll consume as many as they have to in order to achieve an overall match. If the HTML is minified (specifically, if it has very few or no newlines), each of those .*?'s may end up trying to consume the entire rest of the document, and they may have to do it many times.

That's particularly true when no match is possible; it has to travel every possible path through the text before it admits defeat. Another problem is that reluctant quantifiers won't prevent a match that starts too early. Given this string:

<a href="www.blah.com">...</a> <a href="www.foo.com">F.100</a>

...it will start matching at the first <a> tag, and stop at the end of the second one. In this regex:

'~<a\b[^>]*\bhref="[^"]*"[^>]*>([FG]\.\d{1,4})</a>~i'

...I've replaced every .*? with [^>]* or [^"]* to confine those parts of the match to a single tag or attribute value respectively. Although this regex works much better, be aware that it's not foolproof--far from it. But it's about as close as you can reasonably get when matching HTML with regexes.

Thanks -- this is useful to know, as well. – a coder Sep 21 '12 at 17:56 — a coder, Sep 21 '12 at 17:56

Using regex to extract href descriptions based on criteria

3 Answers3