16

I'm dealing with a third party PHP library that I can't edit, and it's been working fine for almost a year. It uses simplexml_load_string on the response from a remote server. Lately it's been choking on large responses. This is a data feed for real estate listings, and the format looks something like this:

<?xml version="1.0"?>
<RETS ReplyCode="0" ReplyText="Operation Successful Reference ID: 9bac803e-b507-49b7-ac7c-d8e8e3f3aa89">
<COUNT Records="9506" />
<DELIMITER value="09" />
<COLUMNS>   sysid   1   2   3   4   5   6   </COLUMNS>
<DATA>  252370080   Residential 0.160   No  ADDR0   06051</DATA>
<DATA>  252370081   Residential 0.440   Yes ADDR0   06043</DATA>
<DATA>  252370082   Residential 1.010   No  ADDR0   06023</DATA>
<DATA>More tab delimited text</DATA>
<!-- snip 9000+ lines -->
</RETS>

I downloaded a sample file of a response (about 22MB), here's where I ended up with my debugging and sanity. Both servers are running PHP Version 5.3.8, but note the different results. I'm as certain as I can be that both files are the same (I suppose the different filesize, strlen, and last 50 chars can be explained by Windows newlines having an extra carriage return character). Test script:

error_reporting(-1);
ini_set('display_errors', 1);
$file = 'error-example.xml';
$xml = file_get_contents($file);

echo 'filesize:              ';
var_dump(filesize($file));

echo 'strlen:                ';
var_dump(strlen($xml));

echo 'simplexml object?      ';
var_dump(is_object(simplexml_load_string($xml)));

echo 'Last 50 characters:    ';
var_dump(substr($xml, -50));

Output locally on Windows:

filesize:              int(21893604)
strlen:                int(21893604)
simplexml object?      bool(true)
Last 50 characters:    string(50) "RD DR    CT  Watertown   203-555-5555            </DATA>
</RETS>"

Output on remote UNIX server:

filesize:              int(21884093)
strlen:                int(21884093)
simplexml object?      
Warning: simplexml_load_string(): Entity: line 9511: parser error : internal error in /path/to/test.php on line 19

Warning: simplexml_load_string(): AULTED CEILING IN FOYER, BRICK FP IN FR, NEW FLOORING IN LR DR FR FOYER KITCHEN  in /path/to/test.php on line 19

Warning: simplexml_load_string():                                                                                ^ in /path/to/test.php on line 19

Warning: simplexml_load_string(): Entity: line 9511: parser error : Extra content at the end of the document in /path/to/test.php on line 19

Warning: simplexml_load_string(): AULTED CEILING IN FOYER, BRICK FP IN FR, NEW FLOORING IN LR DR FR FOYER KITCHEN  in /path/to/test.php on line 19

Warning: simplexml_load_string():                                                                                ^ in /path/to/test.php on line 19
bool(false)
Last 50 characters:    string(50) "ORD DR   CT  Watertown   203-555-5555            </DATA>
</RETS>"

Some replies to comments and additional info:

  • The XML itself appears to be valid as far as I can tell (and it does work on my system).

  • magic_quotes_runtime is definitely off.

  • The working server has libxml Version 2.7.7 while the other has 2.7.6. Could that really make the difference? I could not find a libxml change log but it seems unlikely.

  • This seems to only happen when the response/file is over a certain size, and the error always occurs at the next-to-last line.

  • I am not running into memory issues, the test script runs instantly.

There are differences in the PHP configurations which I can post if I knew which ones were relevant. Any idea what the problem could be, or know of anything else I might want to check?

hakre
  • 193,403
  • 52
  • 435
  • 836
Wesley Murch
  • 101,186
  • 37
  • 194
  • 228
  • Just guessing: If `magic_quotes_runtime` is set, you could do `$xml=stripslashes($xml);` after doing `file_get_contents(...)` – web-nomad Feb 19 '13 at 05:51
  • Might be `error_reporting` and `display_errors`. [Official Docs](http://www.php.net/manual/en/errorfunc.configuration.php#ini.error-reporting). Also, check `memory_limit` since it sounds like your script would likely exceed the default limit. – neelsg Feb 19 '13 at 06:11
  • Also, even though you get different error messages, it does look like you get the same general issue on both, so I'm leaning towards an invalid xml file. – neelsg Feb 19 '13 at 06:15
  • @neelsg I don't get any issue at all on one of them, so I don't know what you could possibly mean? – Wesley Murch Feb 21 '13 at 18:49
  • Is one of these systems running 32 bit libs and the other 64? – Francis Avila Feb 21 '13 at 18:57
  • Tested that it works fine on PHP 5.4.4 and libxml 2.7.8 (on OS X) and on PHP 5.2.2 and libxml 2.7.6 (on Dreamhost's linux box). Did you try doing utf8_encode() on your $xml? Found this here: http://stackoverflow.com/a/2901794/1320627 – TddOrBust Feb 21 '13 at 19:03
  • @FrancisAvila "32 bit libs" - sorry to be daft but do you mean libxml itself? Here's a side-by-side of the `phpinfo` output if that helps any: http://wesleymurch.com/xml-error.html – Wesley Murch Feb 21 '13 at 19:07
  • @SteveoDevo Thanks for the attention to my issue. I'll give that a try but to be honest this just seems buggy to me, especially since it seems to be related to the size of the input. I've been working around the issue in production by breaking the response into smaller pieces but it's not a permanent solution. I don't understand the errors - it's pointing to a space character... – Wesley Murch Feb 21 '13 at 19:11
  • @SteveoDevo It works fine for you on Linux with PHP 5.2.2 and libxml 2.7.6? The exact same test script? If that's true I may just give up and delete the question, it might be something impossible for others to troubleshoot. – Wesley Murch Feb 21 '13 at 19:17
  • Did you open it up in a hex editor to make sure it's a real space and not some invalid bytes? – Francis Avila Feb 21 '13 at 19:19
  • Yep, it works fine. Same test script. Same input file (copied directly to my web host from the zip file, so I never resave it). I had the PHP version wrong: 5.2.17. – TddOrBust Feb 21 '13 at 19:26

3 Answers3

34

The libxml2 changelog contains "608773 add a missing check in xmlGROW (Daniel Veillard)", which seems to be related to input buffering. Note I don't know anything about libxml2 internals, but it seems conceivable that you have tickled a 2.7.6 bug fixed in 2.7.7.

Check if the behavior is any different when you use simplexml_load_file() directly, and try setting libxml parser-related options, e.g.

simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE)

Specifically, you might want to try the LIBXML_PARSEHUGE flag.

http://php.net/manual/en/libxml.constants.php
XML_PARSE_HUGE flag relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes.

Yann Chabot
  • 4,789
  • 3
  • 39
  • 56
Francis Avila
  • 31,233
  • 6
  • 58
  • 96
  • I'll take a look at this answer and your comments this evening (am swamped with work right now), thanks a lot for replying and sorry to be in a hurry/inattentive. – Wesley Murch Feb 21 '13 at 19:22
  • All signs seem to be pointing to the idea that we need to upgrade libxml. As far as I've read, I think we need to recompile PHP. Sorry to be inattentive to this post, I've had other things on the front burner. – Wesley Murch Feb 26 '13 at 14:13
  • First I'm going to try downgrading my local libxml and see if I can reproduce the error. – Wesley Murch Feb 26 '13 at 14:25
  • 1
    Oh, dude, `LIBXML_PARSEHUGE` was it! I don't know how, but I missed it earlier. Thanks, sorry again for being a space case. – Wesley Murch Feb 27 '13 at 19:30
  • It's worth noting that this applies to other libxml-based features such as XMLReader. – Álvaro González Nov 25 '15 at 08:22
  • LIBXML_PARSEHUGE was definitely the solution! Thanks! – Facundo Fasciolo Dec 04 '15 at 15:40
  • Hello @FacundoFasciolo, I am facing same issue, your suggestion doesn't seem to work. I have posted my scenario http://stackoverflow.com/questions/42467791/huge-input-lookup-error-on-simplexml-load-string-function. Do you have any Idea? – VijayRana Feb 26 '17 at 11:20
  • !!!Great LIBXML_PARSEHUGE, just disable the NOTICES if you don't wont to be bored. – David Lopes Jun 14 '18 at 14:39
2

Your XML is Invalid and should cause an issue in both cases.

You need to have ONLY ONE ROOT.

ie. Everything should be inside your tags:

<?xml version="1.0"?>
<RETS>
    ...
</RETS>

You have multiple roots in your XML, which will cause an issue :-)

Try wrapping it all in a root node and see if it works.

<?xml version="1.0"?>
<rootNode>
    <RETS>
    ...
    </RETS>
    <count bla="99" />
</rootNode>

I'm not sure if it would be the difference in libxml, or a different level of error reporting allowing it to work on one and not the other, but that looks like the issue to me.

Andrew
  • 12,617
  • 1
  • 34
  • 48
  • Unfortunately this wasn't the cure! The only thing I can think of is the libxml version, that's what I'm going to check next (I've been trying to avoid it). – Wesley Murch Feb 26 '13 at 14:26
0

My XMLSpy confirmed that your XML file (which I downloaded from the link you provided) has no issues and is well-formed.

One potential issue however is the fact that the "encoding" attribute is missing from the XML preamble: Depending on your version of libxml2, I guess the following scenario might be possible: Server checks for encoding attribute, in lack of which server falls back to some default value (configuration setting). Maybe older library versions don't check the BOM.

Please also see this link, they had a similar encoding problem with libxml: https://stackoverflow.com/questions/4724241/utf-8-problems-with-php-dom-on-debian-server

the essence of which is that an upgrade of your libxml library might indeed solve the problem. Alternatively it might be worth checking the default encoding setting in the configuration.

According to my XMLSpy, the file is utf-8 encoded - as a test, maybe it's worth checking if specifying

<?xml version="1.0" encoding="UTF-8"?>

as the file preamble stops your Unix server from choking.

Community
  • 1
  • 1
marty
  • 4,005
  • 22
  • 19
  • Unfortunately this wasn't the cure! The only thing I can think of is the libxml version, that's what I'm going to check next (I've been trying to avoid it). Since the errors only seem to occur when the input is beyond a certain size, I'm guessing/hoping it's a bug that can be solved by an upgrade. – Wesley Murch Feb 26 '13 at 14:28