I'm looking to parse some HTML which is submitted from ckeditor. The HTML which is posted looks like the below:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#012<html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>
(formatted, without claiming congruency):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p>
Text Before
<img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
Text After
</p>
</body>
</html>
I've been looking to use something like the below:
$DOM = new DOMDocument;
$DOM->loadHTML($input);
$items = $DOM->getElementsByTagName('*');
foreach ($items as $item) {
switch ($item->nodeName) {
case "p":
$sms .= $item->nodeValue."\n";
break;
case "img":
$img_out .= "IMG Attr: ".$item->getAttribute('title')."\n";
break;
}
}
My aim to to create a plain text string, replacing the image based on its title, so I'd have a string like:
Text Before HAMBURGER Text After
I've started going down the DOM route, as it seems the best way to do it, but now I have two questions:
- If I loop over the document as above the IMG ends up AFTER the text, not in the middle of it. How could I avoid this?
- The best way to extract all the plain text from the DOM document, keeping the order of items (linked to point 1).
Thanks in advance to anyone that can give me some input in to this.