-1

I'm struggling to extract content from a string (stored in DB). Each div is a chapter, and the h2 content is the title. I want to extract separatly the title and the content of each chapter (div)

<p>
<div>
   <h2>Title 1</h2>
   Chapter Content 1 with standard html tags (ex: the following tags)
   <strong>aaaaaaaa</strong><br />
   <em>aaaaaaaaa</em><br />
   <u>aaaaaaaa</u><br />
   <span style="color:#00ffff"></span><br />
</div>
<div>
   <h2>Title 2</h2>
   Chapter Content 2
</div>
...
</p>

I've tryed with preg_match_all in php, but it doesn't work when i've standard html tags

function splitDescription($pDescr)
{
    $regex = "#<div.*?><h2.*?>(.*?)</h2>(.*?)</div>#";
    preg_match_all($regex, $pDescr, $result);

    return $result;
}
  • Using REGEX to parse HTML is just a bad idea in itself, use an instance of DOMDocument to parse your HTML. – Marcus Recck Jul 19 '12 at 17:17
  • 1
    Have you herd of html parsers- [DOMDocument](http://php.net/manual/en/class.domdocument.php), [SimpleXml](http://php.net/manual/en/book.simplexml.php) also see this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Musa Jul 19 '12 at 17:20

2 Answers2

1

Before you try and use regex to parse HTML, I recommend you read this post.

There's plenty of good XML/HTML parsers you can use.

Community
  • 1
  • 1
Will
  • 1,621
  • 15
  • 20
1

Don't use regex for this, it's not the correct tool for the job. Use an HTML parser such as PHP's DOMDocument:

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXPath( $doc);

// For each <div> chapter
foreach( $xpath->query( '//div') as $chapter) {

    // Get the <h2> and save its inner value into $title
    $title_node = $xpath->query( 'h2', $chapter)->item( 0);
    $title = $title_node->textContent;

    // Remove the <h2>
    $chapter->removeChild( $title_node);

    // Save the rest of the <div> children in $content
    $content = '';
    foreach( $chapter->childNodes as $child) {
        $content .= $doc->saveHTML( $child);
    }
    echo "$title - " . htmlentities( $content) . "\n";
}

Demo

nickb
  • 59,313
  • 13
  • 108
  • 143