7

SEE EDITS AT BOTTOM TO SHOW MORE ACCURATE ERROR OUTPUT

I'm parsing somewhat large (~15MB) XML files with PHP for the first time using SimpleXML. The files are flight search results so they have long attributes (links back to Kayak; example:
"/book/flightcode=1238917408.NxJI6G.0.F.ORBITZAIR,ORBITZAIR.0.f36f1ea92513977249aa695112410052&sid=26-Vu01v7ilzhSAjPVLZ3Ul"

SimpleXML throws this error when parsing:

"Entity: line 10: parser error : EntityRef: expecting ';' in" and then;

"38917408.NxJI6G.0.F.ORBITZAIR,ORBITZAIR.0.f36f1ea92513977249aa695112410052&sid in" and then;

"simplexml_load_string() [function.simplexml-load-string]: ^ in,"

and so forth for each line where there are these urls.

I found a mention of SimpleXML not liking long attributes on php.net with no solution. I would rather just use and learn SimpleXML for now and work past this error if there is a non-janky, somewhat easy workaround.

Does anyone have a solution? Thanks in advance!

I tried entering the first 13 lines of the XML but it only outputs the info without the XML so.... I can do that if it will help. I'm not sure if using another parser/extension would reduce the functionality or ease of use but please feel free to suggest another if there's not workaround (DOM or XMLReader is what I'm thinking perhaps).

EDITS BELOW TO INCLUDE LESS ADULTERATED ERROR OUTPUT:

http://dl.dropbox.com/u/10206237/stack_overflow_xml.xml

ERROR 1:

simplexml_load_string() [<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: Entity: line 10: parser error : EntityRef: expecting ';' in 

ERROR 2:(The XML I think is fine because it works with a Python script using DOM; I'm translating it to PHP because I don't know Python). I didn't know that the output in the browser would be different. Thanks for being patient.)

<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: 38917408.Pt8rW8.0.F.ORBITZAIR,ORBITZAIR.0.f36f1ea92513977249aa695112410052&amp;_sid_ in 

ERROR 3:

function.simplexml-load-string</a>]:                                                                                ^ in     

(all of those spaces are in there)

hakre
  • 193,403
  • 52
  • 435
  • 836
JohnAllen
  • 7,317
  • 9
  • 41
  • 65
  • 7
    It's not the "long" attribute, it's the '`&`' in the attribute. It's not a proper xml entity. All literal ampersands need to be encoded (ironically) as `&` - the error states it's expecting ';' because it wants '`&sid`' to be an entity ie: '`&sid;`'. – Darryl E. Clarke Dec 27 '10 at 16:24
  • 1
    The solution is to ask whoever generated that XML to fix their code and output some valid XML plzkthx. – Josh Davis Dec 27 '10 at 18:20
  • In the actual file it says: " &_sid_=15- The error was output by my browser. I clearly know nothing about encoding. – JohnAllen Dec 27 '10 at 18:41
  • 1
    That last comment pretty much invalidates everything that's been posted. Never ever look at XML content inside of a browser please. Post a link to the XML file as well as an **unadulterated** sample of the error messages. If taken from a browser, use "View source" to avoid what you've just described. – Josh Davis Dec 27 '10 at 18:44
  • I edited the OP to include better information. Thanks again for the help and patience! – JohnAllen Dec 27 '10 at 20:01
  • The only error I get from the XML file you've linked to is that it's been cut off, the `` and `` tags are not closed. No malformed entities in sight. – Josh Davis Dec 27 '10 at 21:12

4 Answers4

12

As mentionned in other answers and comments, your source XML is broken and XML parsers are supposed to reject invalid input. libxml has a "recover" mode which would let you load this broken XML, but you would lose the "&sid" part so it wouldn't help.

If you're lucky and you like taking chances, you can try to somehow make it work by kind-of-fixing the input. You can use some string replacement to escape the ampersands that look like they're in the query part of an URL.

$xml = file_get_contents('broken.xml');
// replace '&' followed by a bunch of letters, numbers
// and underscores and an equal sign with &amp;
$xml = preg_replace('#&(?=[a-z_0-9]+=)#', '&amp;', $xml);
$sxe = simplexml_load_string($xml);

This is, of course, nothing but a hack and the only good way to fix your situation is to ask your XML provider to fix their generator. Because if it generates broken XML, who knows what other errors slip by unnoticed?

Randell
  • 6,112
  • 6
  • 45
  • 70
Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • How to examine if the parsed XML input is invalid? SimpleXmlElement() function doesn't return false in case of invalid XML? Does it? – scaryguy Sep 06 '12 at 08:01
  • If the XML is invalid then you're kind of screwed really. You can try to salvage data using string manipulation (as opposed to XML manipulation) but the only sure way to fix the situation is to produce valid XML. – Josh Davis Sep 06 '12 at 17:14
3

Darryl has the right answer as to why this is happening in his comment above. One way of fixing this would be to do a str_replace() to replace all '&' ampersands with '&amp;' in the XML. According to the PHP manual you could also use this regular expression to replace ampersands with their entities:

$s = preg_replace('/&[^; ]{0,6}.?/e', "((substr('\\0',-1) == ';') ? '\\0' : '&amp;'.substr('\\0',1))", 
Jeremy
  • 2,651
  • 1
  • 21
  • 28
0

Maybe the parsed xml file may be too big for the parser. But you can try to pass LIBXML_PARSEHUGE as an option - which helped in my case.

Markus Zeller
  • 8,516
  • 2
  • 29
  • 35
0

I had this problem with 13MB files and solved it by including LIBXML_PARSEHUGE parameter:

$xml = new SimpleXMLElement($contents, LIBXML_PARSEHUGE);

NOTE: using ini_setat 1GB didnt solve my problem because PARSED contents occupied more than this.

A more radical approach is using other libraries to STREAM rather than LOAD WHOLE FILE (SAX parser versus DOM parser), like XML Streamer

tony gil
  • 9,424
  • 6
  • 76
  • 100