0

i would like to get in a text all opened and closed html tag.

I mean with this pattern: <[a-z]+> and </[a-z]+> (without considering tag with number or any attribute or any xhtml self closed tag)

I mean using 2 preg_match_all to get em both:

preg_match_all( '#<([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );
preg_match_all( '#<\/([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );

the first will put any tags within array $start and the second within $end.

Is there a way to get em using only single instance of preg_match_all? (I think with only 1 preg the function will be much faster)

Thanks

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055

4 Answers4

2
preg_match_all( '#</?([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );

will catch both opening and closed tags.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
1

Consider

preg_match_all( '#</?([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );

meaning that the / may be there or may not be there.

Erik
  • 4,120
  • 2
  • 27
  • 20
0

Please read this answer to the general question of parsing HTML with regular expressions. It is the highest rated answer in the history of Stack Overflow. Then read some of the other answers in that thread for proper tools for doing this.

Community
  • 1
  • 1
Peter Rowell
  • 17,605
  • 2
  • 49
  • 65
  • 1
    regex for this case are just fine – dynamic Apr 20 '11 at 18:23
  • Good luck then. The number of exceptions grows as you try to handle what can be between the tags. It's amazing how many people have to walk down this path before they learn this lesson. – Peter Rowell Apr 20 '11 at 18:30
  • 1
    Peter, who said anything about matching text between tags? The OP wants to match the tags only. Sure, tags can also be placed inside comments (in which case a regex won't work), but if you're aware of that, I see no problem with a bit of regex here. _I_ find it amazing how many people display this parrot-behavior of posting a link to some "famous" SO-question only because they see the word "regex" and "html" mentioned in 1 sentence. – Bart Kiers Apr 20 '11 at 19:22
  • I agree with Peter. @Bart and @yes123 see the update in my answer. – Wes Apr 20 '11 at 23:17
-3

Do not use regular expressions to parse html. Rather view this for more info on good parsers that can get you what you need:

html parser for php

[UPDATE]

So to my refuters who are so passionate about regexs tell me how to interpret these results and how they exactly match what @yes123 wanted?

<?
$html = <<<HTML
<html>
<head>
<body a="asdf">
<br />
<p>
broken document

<br>

good luck with that
</body>
HTML;

preg_match_all( '#</?([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );

var_dump($start[0]);
?>

which gives:

array(2) {
  [0]=>
  array(5) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "<html>"
      [1]=>
      int(0)
    }
    [1]=>
    array(2) {
      [0]=>
      string(6) "<head>"
      [1]=>
      int(7)
    }
    [2]=>
    array(2) {
      [0]=>
      string(3) "<p>"
      [1]=>
      int(37)
    }
    [3]=>
    array(2) {
      [0]=>
      string(4) "<br>"
      [1]=>
      int(58)
    }
    [4]=>
    array(2) {
      [0]=>
      string(7) "</body>"
      [1]=>
      int(84)
    }
  }
  [1]=>
  array(5) {
    [0]=>
    array(2) {
      [0]=>
      string(4) "html"
      [1]=>
      int(1)
    }
    [1]=>
    array(2) {
      [0]=>
      string(4) "head"
      [1]=>
      int(8)
    }
    [2]=>
    array(2) {
      [0]=>
      string(1) "p"
      [1]=>
      int(38)
    }
    [3]=>
    array(2) {
      [0]=>
      string(2) "br"
      [1]=>
      int(59)
    }
    [4]=>
    array(2) {
      [0]=>
      string(4) "body"
      [1]=>
      int(86)
    }
  }
}
Community
  • 1
  • 1
Wes
  • 6,455
  • 3
  • 22
  • 26