DOMDocument changing characters

Question

aren't becomes arenâ€™t and various other silliness.

Here's the code; this is working within WordPress to automate removal of an element from several hundred posts.

function removeImageFromPages() {
    $pages  = get_pages(array('exclude' => '802,6,4'));
    foreach($pages AS $page) {
       if($page->post_content == '') { continue; }      
       $doc     = new DOMDocument('1.0', 'UTF-8');
       $post_content    = stripslashes($page->post_content);
       @$doc->loadHTML($post_content);
       $content = $doc->saveXML(); 
       echo($content); exit;
    }
}

Originally the post content I'm manipulating was stored in a custom CMS. The initial scrape was done with DOMDoc, without any encoding issues. However, there seems to be some kind of trouble the second time around. All headers on everything are set as UTF-8, but I'm not very experienced with encoding. The first time, it was a pure HTML scrape. Now, I'm dealing with values directly from the database. What am I missing? (And is DOMDoc even the right tool for this job?)

Update - I'm still having the problem, but have new information.

If I print/echo/var_dump the content directly from WordPress ($page->post_content), there is no issue. Once it goes through $doc->saveXML or $doc->saveHTML, the characters become confused. They don't become predictably confused, though.

$doc->loadHTML($page->post_content);
echo($doc->saveXML());

Yields arenâ€™t. However

$doc->loadHTML($page->post_content);
$ps = $doc->getElementsByTagName('p');
echo($ps->item(3)->nodeValue);
echo($doc->saveXML($ps->item(3)));

Yields arenât (in both echos).

Also, if I copy/paste a string from the document directly into the function, it works perfectly. It's only when dealing with values passed from WordPress.

Check that your database connection, and the table collation, are also utf-8. You have to have a pure UTF-8 pipeline throughout the system. If even a single stage anywhere is some other character set, you're going to get mangled text like this. — Marc B, Aug 09 '11 at 15:48
Everything associated with the DB that I can see (charset and collation) is utf-8. — Altari, Aug 09 '11 at 15:54
The connection itself has to be utf-8 as well. Try a `set names 'utf-8'`: http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html — Marc B, Aug 09 '11 at 15:58
I've learned something new! However, I ran through every variable on that page ('show variables like "[item]";') and everything came back utf-8. I've had this issue before, but always assumed it was an encoding issue with my IDE/FTP program rather than an issue with DOMDocument. This is the first time the data hasn't touched my localhost, though, so I know it's not IDE/FTP. — Altari, Aug 09 '11 at 16:05

score 1 · Answer 1 · edited May 23 '17 at 09:58

Going through the comments on the PHP documentation page for DOMDocument::loadHTML, it appears that loadHTML does not respect the encoding you might have set on the DOMDocument.

Instead, it will read it from the meta tag in the HTML. With the original scraping, I presume you were dealing with complete pages including meta tags.

The post_content of a WordPress page, however, as far as I know, is only a document fragment, not a complete HTML page (or did you change that?). So now it can't figure out the encoding from the content and defaults to ISO 8859-1 and screws everything up. Not to mention it adds a doctype and htmland body tags etc. around the fragment.

I'm not entirely sure if DOMDocument is the right tool here, but I'm not sure what the alternative are in your case (apart from regular expressions, obviously).

What you can probably do, though, is wrap a simple HTML structure around the post content, including a meta tag to make sure it's UTF-8, before you pass it to loadHTML() and then use XPath to save just the body of it.

Thanks! I checked the encoding going in and coming out, but didn't think about how it was handling it in process. Ultimately I used regex ("so now I have 2 problems" as I've been told), but will remember to surround it in proper html the next time i need to do mass editing. — Altari, Aug 09 '11 at 21:11

DOMDocument changing characters

1 Answers1