Identifying 2 groups of "match" with 1 run?

Question

i would like to get in a text all opened and closed html tag.

I mean with this pattern: <[a-z]+> and </[a-z]+> (without considering tag with number or any attribute or any xhtml self closed tag)

I mean using 2 preg_match_all to get em both:

preg_match_all( '#<([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );
preg_match_all( '#<\/([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );

the first will put any tags within array $start and the second within $end.

Is there a way to get em using only single instance of preg_match_all? (I think with only 1 preg the function will be much faster)

Thanks

Are you constrained to using regular expressions? A DOM parser will almost certainly be more expressive to use for a case like this. — Lightness Races in Orbit, Apr 20 '11 at 18:07

score 2 · Accepted Answer · answered Apr 20 '11 at 18:09

2

preg_match_all( '#</?([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );

will catch both opening and closed tags.

answered Apr 20 '11 at 18:09

Tim Pietzcker

328,213
58
503
561

damn me. Why i didn't think about `?` before – dynamic Apr 20 '11 at 18:22

Erik · Answer 2 · 2011-04-20T18:19:28.573

1

Consider

preg_match_all( '#</?([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );

meaning that the / may be there or may not be there.

edited Apr 20 '11 at 18:19

answered Apr 20 '11 at 18:09

Erik

4,120
2
27
20

score 0 · Answer 3 · edited May 23 '17 at 10:27

0

Please read this answer to the general question of parsing HTML with regular expressions. It is the highest rated answer in the history of Stack Overflow. Then read some of the other answers in that thread for proper tools for doing this.

edited May 23 '17 at 10:27

Community

1
1

answered Apr 20 '11 at 18:12

Peter Rowell

17,605
2
49
65

1

regex for this case are just fine – dynamic Apr 20 '11 at 18:23
Good luck then. The number of exceptions grows as you try to handle what can be between the tags. It's amazing how many people have to walk down this path before they learn this lesson. – Peter Rowell Apr 20 '11 at 18:30
1

Peter, who said anything about matching text between tags? The OP wants to match the tags only. Sure, tags can also be placed inside comments (in which case a regex won't work), but if you're aware of that, I see no problem with a bit of regex here. _I_ find it amazing how many people display this parrot-behavior of posting a link to some "famous" SO-question only because they see the word "regex" and "html" mentioned in 1 sentence. – Bart Kiers Apr 20 '11 at 19:22
I agree with Peter. @Bart and @yes123 see the update in my answer. – Wes Apr 20 '11 at 23:17

score -3 · Answer 4 · edited May 23 '17 at 12:26

Do not use regular expressions to parse html. Rather view this for more info on good parsers that can get you what you need:

html parser for php

[UPDATE]

So to my refuters who are so passionate about regexs tell me how to interpret these results and how they exactly match what @yes123 wanted?

<?
$html = <<<HTML
<html>
<head>
<body a="asdf">
<br />
<p>
broken document

<br>

good luck with that
</body>
HTML;

preg_match_all( '#</?([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );

var_dump($start[0]);
?>

which gives:

array(2) {
  [0]=>
  array(5) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "<html>"
      [1]=>
      int(0)
    }
    [1]=>
    array(2) {
      [0]=>
      string(6) "<head>"
      [1]=>
      int(7)
    }
    [2]=>
    array(2) {
      [0]=>
      string(3) "<p>"
      [1]=>
      int(37)
    }
    [3]=>
    array(2) {
      [0]=>
      string(4) "<br>"
      [1]=>
      int(58)
    }
    [4]=>
    array(2) {
      [0]=>
      string(7) "</body>"
      [1]=>
      int(84)
    }
  }
  [1]=>
  array(5) {
    [0]=>
    array(2) {
      [0]=>
      string(4) "html"
      [1]=>
      int(1)
    }
    [1]=>
    array(2) {
      [0]=>
      string(4) "head"
      [1]=>
      int(8)
    }
    [2]=>
    array(2) {
      [0]=>
      string(1) "p"
      [1]=>
      int(38)
    }
    [3]=>
    array(2) {
      [0]=>
      string(2) "br"
      [1]=>
      int(59)
    }
    [4]=>
    array(2) {
      [0]=>
      string(4) "body"
      [1]=>
      int(86)
    }
  }
}

that preg_match just works as excepted even if your html is broke. no problems at all there — dynamic, Apr 20 '11 at 23:27
Did you read my question? Do you get that the regex in this case gives exactly what i want? — dynamic, Apr 20 '11 at 23:34

Identifying 2 groups of "match" with 1 run?

4 Answers4