Notices those responses about not using regex? Why is that? Well that's because HTML represents structure. Thought to be honest that HTML code overuses divs instead of using semantic markup but I'm going to parse it anyways with DOM functions. So then, here's the sample HTML I used:
<html>
<body>
<!-- message -->
<div>
Just the text.
</div>
<!-- / message -->
<!-- message -->
<div>
<div style="margin-left: 20px; margin-top:5px; ">
<div class="smallfont">Quote:</div>
</div>
<div style="margin-right: 20px; margin-left: 20px; padding: 10px;">
Message from <strong>Nickname</strong>
<div style="font-style:italic">Hello. It's a quote</div>
</div>
<br /><br />
It's the simple text
</div>
<!-- / message -->
<!-- message -->
<div>
Text<br />
<div style="margin:20px; margin-top:5px; background-color: #30333D">
<div class="smallfont" style="margin-bottom:2px">PHP code:</div>
<div class="alt2" style="margin:0px; padding:6px; border:1px inset; width:640px; height:482px; overflow:auto; background-color:#FFFACA;">
<code style="white-space:nowrap">
<div dir="ltr" style="text-align:left">
<!-- php buffer start -->
<code>
LALALA PHP CODE
</code>
<!-- php buffer end -->
</div>
</code>
</div>
</div><br />
<br />
More text
</div>
<!-- / message -->
</body>
</html>
Now for the full code:
$doc = new DOMDocument();
$doc->loadHTMLFile('test.html');
// These just make the code nicer
// We could just inline them if we wanted to
// ----------- Helper Functions ------------
function HasQuote($part, $xpath) {
// check the div and see if it contains "Quote:" inside
return $xpath->query("div[contains(.,'Quote:')]", $part)->length;
}
function HasPHPCode($part, $xpath) {
// check the div and see if it contains "PHP code:" inside
return $xpath->query("div[contains(.,'PHP code:')]", $part)->length;
}
// ----------- End Helper Functions ------------
// ----------- Parse Functions ------------
function ParseQuote($quote, $xpath) {
// The quote content is actually the next
// next div over. Man this markup is weird.
$quote = $quote->nextSibling->nextSibling;
$quote_info = array('type' => 'quote');
$nickname = $xpath->query("strong", $quote);
if($nickname->length) {
$quote_info['nickname'] = $nickname->item(0)->nodeValue;
}
$quote_text = $xpath->query("div", $quote);
if($quote_text->length) {
$quote_info['quote_text'] = trim($quote_text->item(0)->nodeValue);
}
return $quote_info;
}
function ParseCode($code, $xpath) {
$code_info = array('type' => 'code');
// This matches the path to get down to inner most code element
$code_text = $xpath->query("//div/code/div/code", $code);
if($code_text->length) {
$code_info['code_text'] = trim($code_text->item(0)->nodeValue);
}
return $code_info;
}
// ----------- End Parser Functions ------------
function GetMessages($message, $xpath) {
$message_contents = array();
foreach($message->childNodes as $child) {
// So inside of a message if we hit a div
// We either have a Quote or PHP code, check which
if(strtolower($child->nodeName) == 'div') {
if(HasQuote($child, $xpath)) {
$quote = ParseQuote($child, $xpath);
if($quote['quote_text']) {
$message_contents[] = $quote;
}
}
else if(HasPHPCode($child, $xpath)) {
$phpcode = ParseCode($child, $xpath);
if($phpcode['code_text']) {
$message_contents[] = $phpcode;
}
}
}
// Otherwise check if we've found some pretty text
else if ($child->nodeType == XML_TEXT_NODE) {
// This might be just whitespace, so check that it's not empty
$text = trim($child->nodeValue);
if($text) {
$message_contents[] = array('type' => 'text', 'text' => trim($child->nodeValue));
}
}
}
return $message_contents;
}
$xpath = new DOMXpath($doc);
// We need to get the toplevel divs, which
// are the messages
$toplevel_divs = $xpath->query("//body/div");
$messages = array();
foreach($toplevel_divs as $toplevel_div) {
$messages[] = GetMessages($toplevel_div, $xpath);
}
Now let's see what $messages
looks like:
Array
(
[0] => Array
(
[0] => Array
(
[type] => text
[text] => Just the text.
)
)
[1] => Array
(
[0] => Array
(
[type] => quote
[nickname] => Nickname
[quote_text] => Hello. It's a quote
)
[1] => Array
(
[type] => text
[text] => It's the simple text
)
)
[2] => Array
(
[0] => Array
(
[type] => text
[text] => Text
)
[1] => Array
(
[type] => code
[code_text] => LALALA PHP CODE
)
[2] => Array
(
[type] => text
[text] => More text
)
)
)
It's separated by message and then further separated into the different content in the message! Now we can even use a basic print function like this:
foreach($messages as $message) {
echo "\n\n>>>>>> Message >>>>>>>\n";
foreach($message as $content) {
if($content['type'] == 'text') {
echo "{$content['text']} ";
}
else if($content['type'] == 'quote') {
echo "\n\n======== Quote =========\n";
echo "From: {$content['nickname']}\n\n";
echo "{$content['quote_text']}\n";
echo "=====================\n\n";
}
else if($content['type'] == 'code') {
echo "\n\n======== Code =========\n";
echo "{$content['code_text']}\n";
echo "=====================\n\n";
}
}
}
echo "\n";
And we get this!
>>>>>> Message >>>>>>>
Just the text.
>>>>>> Message >>>>>>>
======== Quote =========
From: Nickname
Hello. It's a quote
=====================
It's the simple text
>>>>>> Message >>>>>>>
Text
======== Code =========
LALALA PHP CODE
=====================
More text
This all works, once again, because the DOM parsing functions are able to understand structure.