2

I use the PHP zip:// stream wrapper to parse large XML files line by line. For example:

$stream_uri = 'zip://' . __DIR__ . '/archive.zip#foo.xml';
$reader     = new XMLReader();
$reader->open( $stream_uri, null );
$reader->read();

while ( true ) {
    echo( $reader->readInnerXml() . PHP_EOL );
    if ( ! $reader->next() ) {
        break;
    }
}

Quite often an XML file will include dodgy UTF control characters XMLReader doesn't like. So I'd like to implement a custom stream wrapper I can pass the output of the zip:// stream to, which will run a preg_replace on each line to remove those characters.

My dream is to be able to do this:

stream_wrapper_register( 'xmlchars', 'XML_Chars' );
$stream_uri = 'xmlchars://zip://' . __DIR__ . '/archive.zip#foo.xml';

and have XMLReader happily read the tidied-up nodes. I've figured out a way to reconstruct the zip stream URI based on the path passed to my wrapper:

class XML_Chars {

    protected $stream_uri = '';
    protected $handle;

    function stream_open( $path, $mode, $options, &$opened_path ) {
        $parsed_url     = parse_url( $path );
        $this->stream_uri = 'zip:' . $parsed_url['path'] . '#' . $parsed_url['fragment'];

        return true;
    }

}

But I'm puzzled about the best way to open the zip:// stream so I can modify its output and pass the result through to the XMLReader. Can anyone give me any pointers about how to implement that?

And Finally
  • 5,602
  • 14
  • 70
  • 110

1 Answers1

1

In case useful to anybody else, I've found a different way to solve the problem: a stream filter. You define it like this:

class UTF_Character_Filter extends php_user_filter {
    public function filter( $in, $out, &$consumed, $closing ) {
        while ( $bucket = stream_bucket_make_writeable( $in ) ) {
            $consumed += $bucket->datalen;
            // Remove characters in the hex range 0 - 8, B and C, E to 1F
            // i.e. all control characters except newline, tab and return
            $bucket->data = preg_replace( '|[\x0-\x8\xB-\xC\xE-\x1F]|ms', '', $bucket->data );
            stream_bucket_append( $out, $bucket );
        }

        return PSFS_PASS_ON;
    }
}

stream_filter_register( 'utf_character_filter', 'UTF_Character_Filter' );

And use it like this:

php://filter/read=utf_character_filter/resource=zip://archive.zip#import.xml

I'd still be interested to know if anyone's figured out how to make a stream wrapper that can accept the input of another stream wrapper though, as it could be a handy tool.

And Finally
  • 5,602
  • 14
  • 70
  • 110