1

I'm looking to parse some HTML which is submitted from ckeditor. The HTML which is posted looks like the below:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">#012<html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>

(formatted, without claiming congruency):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <body>
        <p>
            Text Before
            <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
            Text After
        </p>
    </body>
</html>

I've been looking to use something like the below:

$DOM = new DOMDocument;
$DOM->loadHTML($input);

$items = $DOM->getElementsByTagName('*');
foreach ($items as $item) {
    switch ($item->nodeName) {
    case "p":
        $sms .= $item->nodeValue."\n";
        break;
    case "img":
        $img_out .= "IMG Attr: ".$item->getAttribute('title')."\n";
        break;
    }
}

My aim to to create a plain text string, replacing the image based on its title, so I'd have a string like:

Text Before HAMBURGER Text After

I've started going down the DOM route, as it seems the best way to do it, but now I have two questions:

  1. If I loop over the document as above the IMG ends up AFTER the text, not in the middle of it. How could I avoid this?
  2. The best way to extract all the plain text from the DOM document, keeping the order of items (linked to point 1).

Thanks in advance to anyone that can give me some input in to this.

Zul
  • 3,627
  • 3
  • 21
  • 35
Alex
  • 35
  • 6
  • 2
    Are you able to use JavaScript? jQuery handles this quite easily, and you could then submit that over AJAX. – Dan Blows Feb 13 '12 at 11:15
  • so your real question is "How to replace an IMG element with it's title attribute", right? – Gordon Feb 13 '12 at 11:31
  • possible duplicate of [PHP or Javascript: Simply Remove and Replace HTML Code](http://stackoverflow.com/questions/3555597/php-or-javascript-simply-remove-and-replace-html-code) – Gordon Feb 13 '12 at 11:32
  • Is your markup always going to be that simple or are there more complicated cases? – Salman A Feb 13 '12 at 11:52

3 Answers3

2

My aim to to create a plain text string, replacing the image based on its title, so I'd have a string like:

Text Before HAMBURGER Text After

An option is to use an XPath query to select the text/titles that you want, and output their respective values.

$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before<img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">Text After</p></body></html>';

$doc = new DOMDocument;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$nodes = $xpath->query('/html/body//text() | /html/body//img/@title');

$text = '';
foreach ($nodes as $node) {
    $text .= $node->nodeValue . ' ';
}

echo $text; // Text Before HAMBURGER Text After 
Community
  • 1
  • 1
salathe
  • 51,324
  • 12
  • 104
  • 132
  • Thanks for that @salathe, I like this solution! In the end I used a line like: `$newitem = new DOMElement('div', $item->getAttribute('title')); $item->parentNode->replaceChild($newitem, $item);` As later in the code I'm using: `$html = $DOM->saveHTML(); $html = substr(strip_tags($html), 1);` (Yes not ideal) But I think your method will give me a much neater solution, thanks a lot! – Alex Feb 14 '12 at 12:25
1

You can use XPath to find specific items and then replace them with new nodes.

E.g.

<?php
foreach( range(0,2) as $i ) {
    $doc = new DOMDocument;
    $doc->loadhtml( getData($i) );
    foo($doc);
}


function foo(DOMDocument $doc) {
    $xpath = new DOMXPath($doc);
    foreach( $xpath->query('//p/img') as $img ) {
        $alt = $img->getAttribute('alt');

        $img->parentNode->replaceChild(
            $doc->createTextNode($alt),
            $img
        );
    }
    echo "\n---\n", $doc->savehtml(), "\n---\n";
}



function getData($i) {
    $rv = null;
    switch($i) {
        case 0; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><p>Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20"> Text After</p></body></html>'; break;
        case 1; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
        case 2; $rv = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html>
                <body>
                    <p>
                        Text Before <img alt="HAMBURGER" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER2" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                    <p>
                        Text Before <img alt="HAMBURGER3" height="20" src="/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png" title="HAMBURGER" width="20">
                        Text After
                    </p>
                </body>
            </html>';
            break;
    }   
    return $rv; 
}

prints

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Text Before HAMBURGER Text After</p></body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                </body></html>

---

---
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
                    <p>
                        Text Before HAMBURGER
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER2
                        Text After
                    </p>
                    <p>
                        Text Before HAMBURGER3
                        Text After
                    </p>
                </body></html>

---

For your question #2: please elaborate. Can be as simple as echo $doc->documentElement->textContent. But could also end up using XSL(T)

VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • I don't think he's trying to replace anything, instead it looks like he just wants to get a string of all of the text and image title content in document order. – salathe Feb 13 '12 at 11:41
  • @salathe: Yes, that sounds right (in the question's context). +1 for your answer... – VolkerK Feb 13 '12 at 11:56
-2

You could simply use a regular expression replacement:

<?php
$text = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">#012<html><body><p>Text Before <img alt=\"HAMBURGER\" height=\"20\" src=\"/sites/all/modules/ckeditor/plugins/apoji/images/emoji-E120.png\" title=\"HAMBURGER\" width=\"20\"> Text After</p></body></html>";
$match = array();
preg_match("/<p[^>]*>(.*(?=<\/p))/i", $text, $match);
echo preg_replace("/<img[^>]*title=\"([^\"]+)\"[^>]*>/i", "$1", $match[1]);
?>
Feysal
  • 623
  • 4
  • 7
  • [No!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Decent Dabbler Feb 13 '12 at 12:24
  • [Yes, in known and limited cases.](http://stackoverflow.com/a/1733489/1199546) If the structure of the HTML is as simple as in the given example, regular expressions work fine. – Feysal Feb 13 '12 at 12:34
  • To quote OP: "I'm looking to parse some HTML which is submitted from ckeditor". I think it's safe to assume this will not be a *known and limited case*. – Decent Dabbler Feb 13 '12 at 12:57