2

I'm currently having a problem importing a large XML file and I can't work out why. We get an XML output from a partner that is around 443MB in size. The error that I get is as follows:

PHP Warning:  SimpleXMLElement::__construct(): Entity: line 1: parser error : internal error in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : internal error in /home/imports/catalog.php on line 54
PHP Warning:  SimpleXMLElement::__construct(): ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded. in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct(): ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded. in /home/imports/catalog.php on line 54
PHP Warning:  SimpleXMLElement::__construct():
 ^ in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct():
 ^ in /home/imports/catalog.php on line 54
PHP Fatal error:  Uncaught exception 'Exception' with message 'String could not be parsed as XML' in /home/imports/catalog.php:54
Stack trace:
#0 /home/imports/catalog.php(54): SimpleXMLElement->__construct('<?xml version="...')
#1 {main}
  thrown in /home/imports/catalog.php on line 54

Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in /home/imports/catalog.php:54
Stack trace:
#0 /home/imports/catalog.php(54): SimpleXMLElement->__construct('<?xml version="...')
#1 {main}
  thrown in /home/imports/catalog.php on line 54

Line 54 of the code is simply:

$xml = new SimpleXMLElement(file_get_contents($_CFG_XML_URL));

As far as I can tell, the error appears to be in the element containing ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded.. Unfortunately this is a long way in to the file and due to its size it's difficult to read the contents. My large-file reader reads in a line at a time and this XML is all on one line so it's too much for it to handle gracefully, even on a workstation with 32GB RAM and a 64-bit editor.

I've tried redownloading the file a few times but the problem is always the same. I've doubled the available memory for the script and it still fails in the same place.

So, I got on to the partner and asked for the XML for this particular item and they provided the following:

<EBook EAN="9792219192201">
    <Title>Success-a-Phobia</Title>
    <SubTitle>Discovering And Conquering Mankinds Most Persuasive, but Unknown, Phobia</SubTitle>
    <Publisher>The Benjamin Consulting Group, LLC</Publisher>
    <PublicationDate>29/09/2012</PublicationDate>
    <Contributors>
        <Contributor Code="A01" Text="By (author)">Benjamin, Marcus D.</Contributor>
    </Contributors>
    <Formats>
        <Format Type="6"/>
    </Formats>
    <ShortDescription>People today still desire to be successful in matters of family, finance or business even though we are in the midst of major social, political and economic challenges. Have you every been at that moment where you wanted to do something significant, yet you were paralyzed from making the necessary choices to realize your dream? Have you experienced failure and are now sitting in the stands, paralyzed from getting back in the &amp;quote;game of life?&amp;quote;  Are you at the verge of a major decision that could affect your life for many years? If you are in this category, this is your book of the year!    With humor, real-life antidotes, real-life examples and solid narration, Marcus Benjamin will guide you toward discovering the most pervasive, yet unknown, phobia in the history of mankind.  Once this phobia is discovered, the second half of the book shows you how to rid yourself of this phobia for good. Not only will this book impact your life, but your approach to marriage, parenting, entrepreneurship, etc will be significantly upgraded.</ShortDescription>
</EBook>

Nothing about that XML rings any alarm bells to me, but clearly partway through PHP is having a problem. It appears to be 978 characters in to the element content but that doesn't ring any particular alarm bells for me.

The PHP script is running from a command line in an Amazon EC2 instance. The OS is the Amazon Linux (RHEL)

So, basically, I'm stuck. Has anyone any ideas what could be causing this problem?

hakre
  • 193,403
  • 52
  • 435
  • 836
Engineer81
  • 1,004
  • 1
  • 11
  • 26
  • The xml you provided works fine for me. http://codepad.viper-7.com/suwsVY – Jonathan Kuhn Dec 20 '12 at 19:04
  • 1
    The only thing I can think of is to open the file with a text editor and take out a large chunk and try to re-load it. Similar to a binary search, remove the last half. If you get no errors, remove the first half and try the second half. work your way down until you find a problem. If both halves work fine, it might just be the file size, but I doubt it because I don't see any memory allocation errors. If either half throws an error, cut that in half again and again until you find the problem. Most likely it is an invalid xml node. – Jonathan Kuhn Dec 20 '12 at 19:09
  • @JonathanKuhn - That is an excellent idea. It's not too easy to work with this file due to its size but I'll give it a go. Certainly the easiest option to begin, I guess – Engineer81 Dec 20 '12 at 19:16
  • 1
    Also, there is a simplexml_load_file() function that returns an instance of simplexml. I doubt it is the issue as that function should just do the same thing, but might help. – Jonathan Kuhn Dec 20 '12 at 19:19
  • My experience with large XML file is that you can better use xml_read for files over ~50MB. It is way faster and uses very low memory. But it is less flexible to work with. – Green Black Dec 20 '12 at 19:20
  • @JonathanKuhn - Your suggestion made me wonder... I was able to load in just the last couple of MB of the file in to my editor so that it was more manageable and I found that this is the very last element. So I've just removed it and I'm reuploading the file for processing. – Engineer81 Dec 20 '12 at 19:26
  • @John - We don't usually have problems with XML files of this size. We do have some that are over 100GB that we import in to MySQL using XML2DB and then process the data from there instead, but we've never needed it for a file so "small". – Engineer81 Dec 20 '12 at 19:27
  • @JonathanKuhn - I removed that last element and then the one that became the last one created a similar error. I've now removed all other elements so that only the original problem element remains and it seems to be running through with no problem (this is only one phase of a larger process and it usually fails in seconds, this time it's running for a while). So, the problem appears to be related to the size, rather than the actual content. Need to see what values I can change to resolve. Maybe an environment issue. Balls! – Engineer81 Dec 20 '12 at 19:52
  • Perhaps then a larger cut needs to be taken out. Like I said before, I don't think it is size but poorly formed xml simply because there are no memory errors. It is possibly a missing closing xml tag or some node that should have a cdata tag around it. I would suggest something like halving the xml nodes and checking the file after that. If you have ssh access, you could use something like head to get the first `-n NNN` number of lines and pipe that to a file. then clean the file up, removing the last partial element and adding the final closing xml tag(s). maybe find an xml validation tool. – Jonathan Kuhn Dec 20 '12 at 19:59
  • http://stackoverflow.com/questions/7528249/tools-to-validate-large-xml-100mb-file – Jonathan Kuhn Dec 20 '12 at 20:01
  • It says the error is in line 1, is the original XML without line breaks? Also, you should probably use [simplexml_load_file](http://php.net/manual/en/function.simplexml-load-file.php) instead of file_get_contents. – dualed Dec 20 '12 at 20:38
  • @dualed - Yes, it is all on one line which is why I have problems with my large-file-editor which likes to read in a line at a time. I've switched it to simplexml_load_file() to see if makes any difference purely because it's quick and easy to do and I can leave it running to see how it performs. If it fails again I will attempt to chop it up in to smaller pieces as Jonathan suggested – Engineer81 Dec 20 '12 at 20:50
  • Interesting... @JonathanKuhn - You thought SimpleXML_Load_File() wouldn't make any difference, yet I set it going an hour ago and I've just come to check and it appears to be running though. While this is an excellent result, I'd still like to know why the other method doesn't work as it's worked on another platform of ours in the past, so there's obviously something at fault. Either way, I'm a happy bunny right now! – Engineer81 Dec 20 '12 at 21:41
  • And did the script finish successfully? – dualed Dec 20 '12 at 22:29
  • It's still running. There's a lot of processing to do, including loading in images from the provider servers. It does appear that this phase completed OK though, yes. – Engineer81 Dec 20 '12 at 22:30
  • I can not say for certain what did the trick. But since file_get_contents reads the whole file into memory and *then* you pass it on to SimpleXML, even if PHP would collect the string eventually it can only do so after SimpleXML is finished parsing it. So you need at least a few hundred MB less memory. – dualed Dec 20 '12 at 22:39
  • PHP is configured on this machine to be able to use up to 6GB so I don't think it's a memory issue. I wonder if it's a limitation of the Amazon EC2 instance that we are using, somehow? But then, we're using an m1.large so I'm sure that's not it. – Engineer81 Dec 21 '12 at 14:46

2 Answers2

0

Try to validate the xml using xmllint. it is available as command line tool for linux.

If the file is valid. You should double check if your memory_limit ini var. Remember that DOM procession (as simple xml does) requires to have the whole file in memory. In your case memory_limit should set to at least 500MB.

If you cannot increase your memory limit you will have to consider a less memory comsuming way to parse the xml. SAX may appropriate in this situation although it requires more programming attention.

In PHP, SAX is available through the xml extension and is enabled by default. Here you can find the documentation

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • I've not tried it. However, it appears to have worked fine when `$xml = new SimpleXMLElement(file_get_contents($_CFG_XML_URL));` was switched for `$xml = new SimpleXML_Load_File($_CFG_XML_URL);` – Engineer81 Dec 20 '12 at 21:52
  • what libxml version uses php? what os do you use? – hek2mgl Dec 20 '12 at 22:02
  • 2.7.8 according to `phpinfo()`. As in the original question: Amazon Linux on EC2 – Engineer81 Dec 20 '12 at 22:25
  • OK. Without having the xml file I cannot say more to this. Can be a libxml or a PHP problem. However you got it working.. – hek2mgl Dec 20 '12 at 22:33
0

978 may not ring any bells, but 1000 might! 4 spaces at the start of the line and then 18 characters for '<ShortDescription>' would provide the 22 characters needed. A round number like 1000 could make some sort of buffer length limitation more likely.

arayq2
  • 2,502
  • 17
  • 21
  • Is it not unusual for values of 1000 to be a problem? I'd expect it's more likely at 1024 or similar. Besides, by removing that element from the end we just found that the preceding one caused the same problem and it's a completely different size. It's just very odd and has been resolved by switching to SimpleXML_Load_File(). Oh, and there aren't any spaces in front because it's all on one line with one element immediately following the other – Engineer81 Dec 21 '12 at 23:31