2

I need to take a string of html text like:

<p>This is a line with no spans<br>
This is a line <span class="second">This is secondary</span><br>  
This is another line <span class="third">And this is third</span> <span class="four">this is four</span></p>

And have it end up as an array in PHP like:

array(
    "This is a line with no spans",
    array(
      "This is a line",
      second => "This is secondary",
    ),
    array(
      "This is another line",
      third => "And this is third",
      four => "this is four"
    )
);

Getting each line into it's own value was easy, I just split the text on <br> and that works fine, but getting lines to be split with the class name I can't quite get. I feel like php's preg_split may hold the key, but I kind of suck with regular expressions and I can't get it figured out.

Any ideas?

aron.duby
  • 2,072
  • 2
  • 15
  • 21

3 Answers3

3

You should not attempt to parse HTML with regex or other means. It is very complicated and will end up with terrible maintenance problems.

I highly recommend you look into how to read a chunk of markup into a DOM document [docs] and then use DOM methods to work with it just like you would browser side.

JAAulde
  • 19,250
  • 5
  • 52
  • 63
  • I've been using DomDocument to get to the point of getting the p tags, but I couldn't figure out a way to get it to split on the line breaks without it becoming text. – aron.duby Aug 13 '11 at 22:16
  • I wouldn't split on line breaks. Walk the nodes, checking their type and name (Do I have a text node? Do I have a BR element? ), and make decisions with that info. – JAAulde Aug 13 '11 at 22:34
  • I could have sworn I had tried that and it didn't work, but it did this time. Thanks man! – aron.duby Aug 13 '11 at 23:01
1

Maybe you can use an XML parser ? Here's the doc.

Cydonia7
  • 3,744
  • 2
  • 23
  • 32
1

It's not a good idea to use regular expressions to parse HTML (cite). It's just not a suitable tool; see @JAAulde's answer.

The best way is to do it purely with the DOM. Loop through all child nodes (including text nodes) to format the array the way you want. Like this:

$p = // get paragraph tag...
$lines = array();
$pChildren = $p->childNodes;
for ($i = 0; $i < $pChildren->length; $i++) {
    $line = array();
    $child = $pChildren->item($i);
    if ($child instanceof DOMText) {
        $line[] = $child->wholeText;
    } elseif ($child instanceof DOMElement) {
        if (strtolower($child->tagName) == 'br') {
            $lines[] = $line;
            $line = array();
        } elseif (strtolower($child->tagName) == 'span' && $child->hasAttribute('class')) {
            $line[$child->getAttribute('class')] = $child->nodeValue;
        }
    }
}

Warning: treat the above as pseudo-code, it has not been tested at all, just going from experience and the manual.

Community
  • 1
  • 1
Jonah
  • 9,991
  • 5
  • 45
  • 79
  • I just finished writing this and came back and saw your answer. Almost identical. – aron.duby Aug 13 '11 at 23:00
  • 1
    For those who come along later with the same question, I do _not_ dispute this being the correct answer. However it is important to point out that the missing step to get from what the OP has to what was accepted as an answer was the reading in of the markup to a PHP DOM Document. See my answer for links to docs on that. – JAAulde Aug 14 '11 at 00:35
  • @JAAulde: excellent point, I'll allude to that and refer to your answer. – Jonah Aug 14 '11 at 01:28