Overly greedy regexp backreference with php preg_replace and a not greedy enough expression

Question

I've googled regexinfo.com'ed and exprimented for hours, and cannot for the life of me figure out what's wrong with these two regular expressions supposed to match meta tags. Any help is greatly appreciated. :)

Purp 1: Captures the "> at the end of lines when $1 is used in preg_replace.

'/<meta[\s]+[^>]*?name[\s]?=[\s"\']+keywords[\s"\']+content[\s]?=[\s"\']+([^"\']*)/ixU'

Purp 2: Doesn't capture lines, more or less on a whim. (never mind lack of support for ')

'/<meta(?=[^>]*name="keywords")\s[^>$]*content="([^"]*)[">]*$/ixU

Please add the subject string as well against which is to be matched. It should help also to show you that's much more easier to use a HTML parser to achieve what you try with regular expressions. — hakre, Dec 28 '11 at 00:46
+1 @hakre - and it wouldn't be an HTML/regex question on SO if somebody didn't link to [this](http://stackoverflow.com/questions/1732348#answer-1732454) so I guess I'll take the hit this time :-D — DaveRandom, Dec 28 '11 at 00:51
lol, agreed, it's a pain. Nevertheless, I'd like to get the *** working. I might give get_meta_data a whirl if I get eternally stuck. — Lars-Erik, Dec 28 '11 at 00:58
Never use the `'U'` modifier! Its _never_ needed and its only purpose is to confuse. Instead simply add an `?` ungreedy modifier to those quantifiers that need it. (And this problem does not need any lazy quantifiers anyway.) — ridgerunner, Dec 28 '11 at 02:38

score 0 · Answer 1 · answered Dec 28 '11 at 01:20

I've spotted you're using three PCRE modifiers^Docs:

i (PCRE_CASELESS) - looks good as tag and attribute names are not case-sensitive in HTML.
x (PCRE_EXTENDED) - you don't need this with your pattern as it looks like.
U (PCRE_UNGREEDY) - not sure if you actually need this as well, it's probably easier to go with the default and control each repetition on it's own, e.g. to change defaults only when needed with a specific quantifier.

One you're probably missing is the m (PCRE_MULTILINE) modifier to make $ actually match the end of a line. Unless used, $ matches the end of the subject string.

A good site explaining regular expressions is http://www.regular-expressions.info/, I sometimes look there if I need to find stuff quickly, because the other good reference for PCRE is all in one text file.

For your case probably this page is interesting about what is greedy and how to deal with it.

Alan Moore · Answer 2 · 2011-12-28T02:12:49.853

Leaving out the optional whitespace and assuming only double-quotes around the attribute values, your first regex is equivalent to this:

'/<meta\s+name="keywords"\s+content="([^"]*?)/i'

If the attributes happen to to be listed in that order, this should match everything up to the opening quote of the content attribute. Inside the capturing group, [^"]* is supposed to consume the attribute value, but because you used the U (ungreedy) flag, it initially consumes nothing, as if were [^"]*?. And that's the end of the regex, so it reports a successful match.

In other words, your immediate problem is that you left out the closing quote. If you want to match the whole tag, you need to add the closing > as well:

'/<meta\s+name="keywords"\s+content="([^"]*)">/i'

But as I said, that only works if there are only the two attributes and they're listed in that order, and it doesn't account for single-quoted or unquoted attribute values, or optional whitespace.

Your second regex deals with the ordering problem by using a lookahead match the name attribute. But it assumes the tag is followed immediately by a line break, which is not something you can count on. You should use the closing > to mark the end of the match:

'/<meta\s+(?=[^>]*name="keywords")[^>]*content="([^"]*)"[^>]*>/i'

And if you want to allow optional whitespace:

'/<meta\s+(?=[^>]*name\s*=\s*"keywords")[^>]*content\s*=\s*"([^"]*)"[^>]*>/i'

I would emphasize that your problem is not one of excess greediness. This regex works without the U flag and with nothing but normal, greedy quantifiers.

When using '/ at the end of the tags in the $1 backreference. — Lars-Erik, Dec 28 '11 at 11:02
You're still leaving out the closing quote. The first regex above is just a simplified version of your own regex including the error; the second regex fixes the the error. You seem to be using the first one. — Alan Moore, Dec 28 '11 at 16:15

ridgerunner · Accepted Answer · 2011-12-28T15:25:08.653

This tested function should do a pretty good job:

// Fetch keywords from META element.
function getKeywords($text) {
    $re = '/# Match META tag having name=keywords values.
        <meta                 # Start of META tag.
        [^>]*?                # Lazily match up to NAME attrib.
        \bname\s*=\s*         # NAME attribute name.
        ["\']?keywords[\'"]?  # NAME attribute value.
        [^>]*?                # Lazily match up to CONTENT attrib.
        \bcontent\s*=\s*      # CONTENT attribute name.
        (?|                   # Branch reset group for keywords value.
          "([^"]*)"           # Either $1.1: a double quoted,
        | \'([^\']*)\'        # or  $1.2: single quoted value
        )                     # End branch reset group.
        [^>]*                 # Greedily match up to end of tag.
        >                     # Literal end of META tag.
        /ix';
    if (preg_match($re, $text, $matches)) {
        return $matches[1];
    } else {
        return 'No META tag with keywords.';
    }
}

Note that the lazy modifiers are not necessary but will make it match just a smidge faster.

Additional 2011-12-28 The OP has clarified the question indicating that only one line of text is available, and the META tag's CONTENT attribute value may thus be truncated. Here is a different regex that captures into capture group 1 the CONTENT attribute value (which may be truncated) and the rest of the tag if its all on one line:

// Fetch keywords CONTENT attrib value from META element.
function getKeywords($text) {
    $re = '/# Match META tag having name=keywords values.
        <meta                 # Start of META tag.
        [^>]*?                # Lazily match up to NAME attrib.
        \bname\s*=\s*         # NAME attribute name.
        ["\']?keywords[\'"]?  # NAME attribute value.
        [^>]*?                # Lazily match up to CONTENT attrib.
        \bcontent\s*=\s*      # CONTENT attribute name.
        (?|                   # Branch reset group for keywords value.
          "([^"\r\n]*)"?      # Either $1.1: a double quoted,
        | \'([^\'\r\n]*)\'?   # or  $1.2: single quoted value
        )                     # End branch reset group.
        (?:                   # Grab remainder of tag (optional).
          [^>\r\n]*           # Greedily match up to end of tag.
          >                   # Literal end of META tag.
        )?                    # Grab remainder of tag (optional).
        /ix';
    if (preg_match($re, $text, $matches)) {
        return $matches[1];
    } else {
        return 'No META tag with keywords.';
    }
}

This works more or less satisfactory, except that I actually hoped to be able to capture the rest of the line if the content attribute value spans multiple lines. (just end it) I tried /]*?\bname\s*=\s*["\']?description[\'"]?[^>]*?\bcontent\s*=\s*(?|"([^"]*)"|\'([^\']*)\')[>|$]/im with no apparent further success. — Lars-Erik, Dec 28 '11 at 11:03
This expression _does_ match the whole META tag even when the value of the CONTENT attribute (or other attributes) spans multiple lines. If you can be more clear about what you are looking for (i.e. which attributes you want to capture) this is easily fixed! Please edit your question to give an example of a multi-line META tag and describe which parts of it you want to captures. — ridgerunner, Dec 28 '11 at 14:27
I only read single lines. Want to capture from content value start to ",',> or eol. — Lars-Erik, Dec 28 '11 at 14:37

Overly greedy regexp backreference with php preg_replace and a not greedy enough expression

3 Answers3