1

I have been attempting to use SimpleXML to parse XHTML which has been captured through output buffering. To get there I used a DOM object as per some excellent suggestions given in another question.

The simpleXML object is fired off to other classes that makes use of it to inspect and/or make changes to the page (but currently ignores it so I can get this working) and then the $xml->asXML() is then output once the manipulation is done with. It’s not simple but it is a reasonably elegant solution to working with some legacy code that I would like to replace sometime.

So far so good but the output now has odd characters that look for all the world like the result of a text encoding issue.

$doc = new DOMDocument();
$doc->loadHTML($this->PAGE); // <-- in goes nicely behaved XHTML
$doc->encoding = 'utf-8'; // This did not seem to help
$xmlObject = simplexml_import_dom($doc,$customClass);
//[...stuff...]
$this->PAGE = $xmlObject->asXML(); // -->outcomes XHTML with cruft
//[...logging and so forth...]
echo $this->PAGE;

According to the meta tag the HTML is assuming iso-8859-1

I’m seeing plenty of  if that is meaningful to anyone?

I tried to use the DOM Doc to convert to utf-8 which I guess was happening anyway because it did not make any difference. Not to mention the meta data in the xHTML is now wrong.

Is there a way of detecting the encoding in use up to that point switching and then switching back perhaps by tokenizing trouble characters or somesuch? Failing that is there another approach that would minimise the resultant mess?

I need to get this right as when finished it will be shared with a bunch of different people (in theory this could be a really big number) with different setups that I am worried about breaking.

UPDATE: Nuts! It seems that the received wisdom is that simpleXML will only deal with utf-8 so as a slight change to the question how can I html-encode anything that would suffer from the change (for the many different languages used) without also encoding the xHTML structure? Or have I hit a dead end?

Community
  • 1
  • 1
  • It does not help much but I have found out that the character encoding is set in the users language file settings so I have a constant I can use to get the "correct" encoding if only I knew how to stop things being mangled by the utf-8 conversion which seems to be part of simpleXML. – Matthew Brown aka Lord Matt Jun 08 '14 at 16:56
  • Also utf8_encode will not help here as the user could be using an encoding that is not supported or might already be utf-8... – Matthew Brown aka Lord Matt Jun 08 '14 at 17:19

0 Answers0