10

I have a string that may look something like this:

$r = 'Filed under: <a>Group1</a>, <a>Group2</a>';

Here is the regular expression I am using so far:

preg_match_all("/Filed under: (?:<a.*?>([\w|\d|\s]+?)<\/a>)+?/", $r, $matches);

I want the regular expression to inside the () to continue to make matches as designated with the +? at the end. But it just won't do it. ::sigh::

Any ideas. I know there has to be a way to do this in one regular expression instead of breaking it up.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Senica Gonzalez
  • 7,996
  • 16
  • 66
  • 108

4 Answers4

12

Just for fun here's a regex that will work with a single preg_match_all:

'%(?:Filed under:\s*+|\G</a>)[^<>]*+<a[^<>]*+>\K[^<>]*%`

Or, in a more readable format:

'%(?:
      Filed under:   # your sentinel string
    |                
      \G             # NEXT MATCH POSITION
      </a>           # an end tag
  )
  [^<>]*+          # some non-tag stuff     
  <a[^<>]*+>       # an opening tag
  \K               # RESET MATCH START
  [^<>]+           # the tag's contents
%x'

\G matches the position where the next match attempt would start, which is usually the spot where the previous successful match ended (but if the previous match was zero-length, it bumps ahead one more). That means the regex won't match a substring starting with </a> until after it's matched one starting with Filed under: at at least once.

After the sentinel string or an end tag has been matched, [^<>]*+<a[^<>]*+> consumes everything up to and including the next start tag. Then \K spoofs the start position so the match (if there is one) appears to start after the <a> tag (it's like a positive lookbehind, but more flexible). Finally, [^<>]+ matches the tag's contents and brings the match position up to the end tag so \G can match.

But, as I said, this is just for fun. If you don't have to do the job in one regex, you're better off with a multi-step approach like the one @codaddict used; it's more readable, more flexible, and more maintainable.

\K reference
\G reference

EDIT: Although the references I gave are for the Perl docs, these features are supported by PHP, too--or, more accurately, by the PCRE lib. I think the Perl docs are a little better, but you can also read about this stuff in the PCRE manual.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I didn't know about `\K`. Interesting! A small note about `\G` - you refer to the "previous match", which is OK, and to the "next match", which is a little confusing (Specially when the Perl example you've linked to is downright misleading - it *sets* the next position in code - **which is very different from the default behavior**). Simply put - `\G` refers to the position the current match was attempted to start in. It is also not accurate `` will always match after `Filed under:` - it can also match on the start of the string, for example `, Group2`: http://ideone.com/aTjrm . – Kobi Aug 21 '11 at 04:22
  • (by the way, I came from here: http://stackoverflow.com/questions/5982451/regex-capturing-a-repeated-group/7135730#7135730) – Kobi Aug 21 '11 at 04:27
  • @Kobi: I should have left out the part about zero-length matches; too much noise, not enough signal. Usually I just say `\G` matches the position where the previous match ended, and don't bother with the piddly details unless they're relevant to the current problem. I mean, what are the odds the string will start with ``? I felt pretty safe with that one. ;) – Alan Moore Aug 21 '11 at 05:12
8

Try:

<?php

$r = 'Filed under: <a>Group1</a>, <a>Group2</a>, <a>Group3</a>, <a>Group4</a>';

if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $r, $matches)) {
    var_dump($matches[1]); 
}

?>

output:

array(4) {
  [0]=>
  string(6) "Group1"
  [1]=>
  string(6) "Group2"
  [2]=>
  string(6) "Group3"
  [3]=>
  string(6) "Group4"
}

EDIT:

Since you want to include the string 'Filed under' in the search to uniquely identify the match, you can try this, I'm not sure if it can be done using a single call to preg_match

// Since you want to match everything after 'Filed under'
if(preg_match("/Filed under:(.*)$/", $r, $matches)) {
    if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $matches[1], $matches)) {
        var_dump($matches[1]); 
    }
}
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Thanks, but I really need to use the "Filed under: " flag. While my example text was rudimentary, the actual file that I am parsing is quite complicated, and Filed under: is really the only unique identifier that I have to work with. Fortunately, it is at the end of the file, so I can match all the way to the end. – Senica Gonzalez Feb 05 '10 at 04:22
2
$r = 'Filed under: <a>Group1</a>, <a>Group2</a>'
$s = explode("</a>",$r);
foreach ($s as $k){
    if ($k){
        $k=explode("<a>",$k);
        print "$k[1]\n";
    }
}

output

$ php test.php
Group1
Group2
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
1

I want the regular expression to inside the () to continue to make matches as designated with the +? at the end.

+? is a lazy quantifier - it will match as few times as possible. In other words, just once.

If you want to match several times, you want a greedy quantifier - +.

Also note that your regex doesn't quite work - the match fails as soon as it encounters the comma between the tags, because you haven't accounted for it. That likely needs correcting.

Anon.
  • 58,739
  • 8
  • 81
  • 86
  • Right, I have tried with just the + quantifier. That fails also. And I did also think about the , [comma] to which I'm afraid I don't know how to set this, since the second or third match may or may not have a comma. I did however try this as my attemp: [code] preg_match_all("/Filed under: (?:([\w|\d|\s]+?)<\/a>.*?)+/", $r, $matches); [/code] – Senica Gonzalez Feb 05 '10 at 04:15
  • Hmmm, comments don't look very pretty. – Senica Gonzalez Feb 05 '10 at 04:15
  • @Senica: you can use backticks to format code in comments just like you can in questions and answers, but if the code is long or complex, you should edit your question and put it there instead. The code you included above was a bit much for a comment. – Alan Moore Feb 05 '10 at 09:25
  • But @Anon. is right: a reluctant quantifier at the end of a regex almost never makes sense. If you regex had been correct otherwise, that final `?` would have broken it. – Alan Moore Feb 05 '10 at 09:26