aren't
becomes aren’t
and various other silliness.
Here's the code; this is working within WordPress to automate removal of an element from several hundred posts.
function removeImageFromPages() {
$pages = get_pages(array('exclude' => '802,6,4'));
foreach($pages AS $page) {
if($page->post_content == '') { continue; }
$doc = new DOMDocument('1.0', 'UTF-8');
$post_content = stripslashes($page->post_content);
@$doc->loadHTML($post_content);
$content = $doc->saveXML();
echo($content); exit;
}
}
Originally the post content I'm manipulating was stored in a custom CMS. The initial scrape was done with DOMDoc, without any encoding issues. However, there seems to be some kind of trouble the second time around. All headers on everything are set as UTF-8, but I'm not very experienced with encoding. The first time, it was a pure HTML scrape. Now, I'm dealing with values directly from the database. What am I missing? (And is DOMDoc even the right tool for this job?)
Update - I'm still having the problem, but have new information.
If I print/echo/var_dump the content directly from WordPress ($page->post_content), there is no issue. Once it goes through $doc->saveXML or $doc->saveHTML, the characters become confused. They don't become predictably confused, though.
$doc->loadHTML($page->post_content);
echo($doc->saveXML());
Yields aren’t
. However
$doc->loadHTML($page->post_content);
$ps = $doc->getElementsByTagName('p');
echo($ps->item(3)->nodeValue);
echo($doc->saveXML($ps->item(3)));
Yields arenât
(in both echos).
Also, if I copy/paste a string from the document directly into the function, it works perfectly. It's only when dealing with values passed from WordPress.