-3

I need to split my html based on a custom html tag.

This is how my html looks like:

<div>
    <div id="header">
        <h1>Document Title</h1>
    </div>

    <div id="content">
        <p>Lorem ipsum dolar sit</p>
        <magicheader type="2" class="someClass">Header</magicheader>
        <p>Lorem ipsum dolar sit</p>
        <span><magicheader type="3" class="someClass">Header</magicheader></span>
    </div>

    <div id="footer">

    </div>
</div>

This is what I need:

Array
(
    [0] => <div>
    <div id="header">
        <h1>Document Title</h1>
    </div>

    <div id="content">
        <p>Lorem ipsum dolar sit</p>
    [1] => <magicheader type="2" class="someClass">Header</magicheader>
    [2] => <p>Lorem ipsum dolar sit</p>
        <span>
    [3] => <magicheader type="3" class="someClass">Header</magicheader>
    [4] => </span>
    </div>

    <div id="footer">

    </div>
</div>
)

Can anybody help me with the pattern?

Arek van Schaijk
  • 1,432
  • 11
  • 34
  • 4
    [Regex cannot parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – jbabey May 08 '13 at 12:51
  • There doesn't seem to be any pattern to the way you are splitting HTML. Can you explain the thinking behind the way you've mentioned the splitting works? – arijeet May 08 '13 at 12:51
  • It is wrong to say that Regex cannot chop up HTML, but quite accurate to say that Regex cannot reliably and accurately parse HTML. It is simply not a wise thing to do unless what you are attempting is a quick and dirty fix to one specific limited problem. Even then, there is usually a better/more appropriate solution. – Tim Radcliffe May 08 '13 at 13:05
  • Regex is not useful for parsing HTML see the answer to this question http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – RMcLeod May 08 '13 at 12:51

1 Answers1

1

You need to use preg_split with PREG_SPLIT_DELIM_CAPTURE:

$text=<<<EOD
<div>
    <div id="header">
        <h1>Document Title</h1>
    </div>

    <div id="content">
        <p>Lorem ipsum dolar sit</p>
        <magicheader type="2" class="someClass">Header</magicheader>
        <p>Lorem ipsum dolar sit</p>
        <span><magicheader type="3" class="someClass">Header</magicheader></span>
    </div>

    <div id="footer">

    </div>
</div>
EOD;

$regexp = '%(<magicheader [^>]*>Header</magicheader>)%';
$value = preg_split($regexp, $text, -1, PREG_SPLIT_DELIM_CAPTURE);

Then print_r($value) outputs:

Array
(
    [0] => <div>
    <div id="header">
        <h1>Document Title</h1>
    </div>

    <div id="content">
        <p>Lorem ipsum dolar sit</p>

    [1] => <magicheader type="2" class="someClass">Header</magicheader>
    [2] => 
        <p>Lorem ipsum dolar sit</p>
        <span>
    [3] => <magicheader type="3" class="someClass">Header</magicheader>
    [4] => </span>
    </div>

    <div id="footer">

    </div>
</div>
)
Rui Jarimba
  • 11,166
  • 11
  • 56
  • 86
hegemon
  • 6,614
  • 2
  • 32
  • 30