1

As the title says.

I'm processing large downloaded XML files on the fly. Some of those files contain invalid characters such as "US" or "VB" (vertical tab). No clue why those characters are there to begin with. There's nothing I can really do about them.

$z = new XMLReader;
$z->open('compress.zlib://'.$file, "UTF-8");
while ($z->read() && $z->name !== 'p');
while ($z->name === 'p'){

try
{
    $node = new SimpleXMLElement($z->readOuterXML());
}catch(Exception $e)
{
    echo $e->getMessage();
}
// And so on
}

I get an error saying "String could not be parsed as XML".

What can I do here?

nick
  • 2,743
  • 4
  • 31
  • 39
  • Strip them out of the file before you parse it. – Petah Feb 15 '12 at 02:01
  • the xml files are gzipped. i need to extract, go through 12gb of xml data, and then parse - this needs to be done daily and those additional steps take too long. its not an option atm – nick Feb 15 '12 at 04:02

1 Answers1

2

Ended up finding a solution after all.

I decided to use fopen to construct & process on the fly. Here's what I ended up with:

$handle = fopen('compress.zlib://'.$file, 'r');
$xml_source = '';
$record = false;
if($handle){
    while(($buffer = fgets($handle, 4096)) !== false){
        if(strpos($buffer, '<open_tag>') > -1){
            $xml_source = '<?xml version="1.0" encoding="UTF-8"?>';
            $record = true;
        }
        if(strpos($buffer, '</close_tag') > -1){
            $xml_source .= $buffer;
            $record = false;
            $xml = simplexml_load_string(stripInvalidXml($xml_source));

            // ... do stuff here with the xml element

        }
        if($record){
            $xml_source .= $buffer;
        }

    }
}

The function simplexml_load_string() is the one quickshiftin provided. Works like a charm.

nick
  • 2,743
  • 4
  • 31
  • 39
  • whoops - and have to be the same tag. :) – nick Feb 15 '12 at 14:15
  • The link to the code is dead, so the answer is currently incomplete – lijat Apr 26 '17 at 10:35
  • @nick do you happen to have the code behind `stripInvalidXml`, since the original link is now dead? I found one example in another SO post [here](https://stackoverflow.com/a/3466049/680920)? – quickshiftin May 20 '21 at 19:28