0

I have the following pattern using it to match HTML tags:

~<([[:alpha:]]+) ([[:alpha:]]+=".*?")*>.*?</\1>~si

It works fine and will match any tag, but it will only search throughout the string for the first match it encounters. For example:

$text = <<<text
<p class="matches">some text, this will match</p>
<p>this won't match</p>
<p>this won't match either</p>
<p class="matches">this will match</p>
<p class="matches">this will match too</p>
<div>This won't match either but I want it to..</div>
text;

$pattern = '~<([[:alpha:]]+) ([[:alpha:]]+=".*?")*>.*?</\1>~si';
preg_match_all($pattern,$text,$matches);
var_dump($matches);

The code posted will fill $matches as I want it to, but $matches[0][*] will only contain the 3 paragraphs that have the class="matches" attribute (I tested this pattern on tags without attributes and it does match those properly too). Rexexp is not my forté... What am I doing wrong?

Yoshi
  • 27
  • 2

1 Answers1

1

Add \s? between your element and attribute match

~<([[:alpha:]]+)\s?([[:alpha:]]+=".*?")*>.*?</\1>~si

Also, you shouldn't be using regex for HTML.

Community
  • 1
  • 1
Justin Johnson
  • 30,978
  • 7
  • 65
  • 89
  • Thanks. Now I can see where I've screwed up. Now I am using: `~<([[:alpha:]]+)(\s*[[:alpha:]]+=".*?")*>.*?\1>~si` That works happily. I really don't see what is wrong with using rexexp to match tags. I'm only using it to find open paragraphs, lists, divs, etc to summarise a piece of html text down to a few paragraphs. I know of potential problems with this, like not matching things like
    but that really doesn't matter in this case.
    – Yoshi Apr 27 '11 at 03:39
  • Oh god. I just actually used my brain for once... Why didn't I just use | and search for a limited set of 6 or so tag names instead of searching for generic tags? Because I'm an idiot, that's why! – Yoshi Apr 27 '11 at 03:52
  • @Yoshi, the reason you shouldn't use regex is because HTML is far more complicated than you realize, and when you discover how complex it is you will find regex inadequate. For instance, how does your regex fare with this **perfectly valid** HTML 4.01 Strict fragment: `

    first part of the text> second part`?

    – eyelidlessness May 21 '12 at 21:22