2

I'm trying to parse a XML file, but when loading it simpleXML prints the following warning:

Warning: simplexml_load_file() [function.simplexml-load-file]: gpr_545.xml:55: parser error : Entity 'Oslash' not defined in import.php on line 35

This is that line:

<forenames>B&Oslash;IE</forenames><x> </x>

As it is a warning, I might ignore it, but I'd like to understand what is happening.

Maarten
  • 4,643
  • 7
  • 37
  • 51

5 Answers5

3

HTML-entities like &Oslash is not the same as XML-entities. Here's a table for replacing HTML-entities to XML-entities.

As I can tell from one of your comments to another post, you're having trouble with an entity &sol;. I don't know if this even is a valid HTML-entity, my Firefox won't show the character - only ouputs the entity name. But I found an other table for most entities and their character reference number. Try adding them to your replace-table and you should be safe. &sol;'s reference number is / by the way.

Björn
  • 29,019
  • 9
  • 65
  • 81
2

I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag. You could try to encode the whole file in utf-8 and removing the escaped version from the tag by that. Aferwards you can read a fully escaped character free file into simplexml.

K

KB22
  • 6,899
  • 9
  • 43
  • 52
  • not sure what you mean. This xml file is encoded as ISO-8859-1 (). – Maarten Sep 15 '09 at 12:32
  • Right: use utf-8 instead of iso-8859-1 – Jeremy L Sep 15 '09 at 12:50
  • yepp, and make use of utf8_encode() for the actual encoding of the text. – KB22 Sep 15 '09 at 12:58
  • that'd make sense if I were the author, but I'm on the parsing end so to say ;-) – Maarten Sep 15 '09 at 13:00
  • You got the file, so you can read it line by line and encode it - can't you? I happend to write a xmlfilter application once for a japanese customer. And belive me, doing this extra step before the actual parsing payed... ;) – KB22 Sep 15 '09 at 15:29
2

HTML Encoding of Latin1 characters (like Ø, what that character describes) is what has broken the XML parser. If you're in control of the data, you need to escape it using XML style character encoding (Ø just happens to be & #216;)

squeeks
  • 1,269
  • 11
  • 14
  • 1
    Yes, unforgiving XML parsers break when they are expecting XML-style encoding of non-ASCII characters and are given HTML-style encoding instead. – squeeks Sep 15 '09 at 12:41
  • ok. So I'm just parsing this. I looked at the table from Björn's answer, and it works for my first example, but the next problem is this entity which is not in that table: &sol; . Is there a more stable solution? – Maarten Sep 15 '09 at 12:48
  • XSLT transforming the document before you pass it off to an XML parser would be one solution. – squeeks Sep 15 '09 at 12:54
1

Just had a very similar problem and solved it in the following way. The main idea was to load a file into a string, replace all bad entities on something like "[[entity]]Oslash;" and carry out reverse replacement before displaying some xml node.

function readXML($filename){
    $xml_string = implode("", file($filename));
    $xml_string = str_replace("&", "[[entity]]", $xml_string);
    return simplexml_load_string($xml_string);
}
function xml2str($xml){
    $str = str_replace("[[entity]]", "&", (string)$xml);
    $str = iconv("UTF-8", "WINDOWS-1251", $str);
    return $str;
}
$xml = readXML($filename);
echo xml2str($xml->forenames);

iconv("UTF-8", "WINDOWS-1251", $str) as I have "WINDOWS-1251" encoding on my page

Krivoi
  • 141
  • 4
0

Try to use this line:

<forenames><![CDATA[B&Oslash;IE]]></forenames><x> </x>

and read this about CDATA

lg.
  • 397
  • 2
  • 12
  • Before parsing you should insert CDATA tag for every entity with "strange" characters. – lg. Sep 15 '09 at 12:50
  • if it's got this error in it, then it's not valid xml to begin with. up to you to tell the original authors to fix it or do this sort of check prior to parsing and wrap the invalid chunks – Jeremy L Sep 15 '09 at 12:51