5

I'm trying and need some help doing the following:

I want to stream parse a large XML file ( 4 GB ) with PHP. I can't use simple XML or DOM because they load the entire file into memory, so I need something that can stream the file.

How can I do this in PHP?

What I am trying to do is to navigate through a series of <doc> elements. And write some of their children to a new xml file.

The XML file I am trying to parse looks like this:

<feed>
    <doc>
        <title>Title of first doc is here</title>
        <url>URL is here</url>
        <abstract>Abstract is here...</abstract>
        <links>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
       </link>
    </doc>
    <doc>
        <title>Title of second doc is here</title>
        <url>URL is here</url>
        <abstract>Abstract is here...</abstract>
        <links>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
       </link>
    </doc>
</feed>

I'm trying to get / copy all the children of each <doc> element into a new XML file except the <links> element and its children.

So I want the new XML file to look like:

<doc>
    <title>Title of first doc is here</title>
    <url>URL is here</url>
    <abstract>Abstract is here...</abstract>
</doc>
<doc>
    <title>Title of second doc is here</title>
    <url>URL is here</url>
    <abstract>Abstract is here...</abstract>
</doc>

I would greatly appreciate any and all help in streaming / stream parsing / stream reading the original XML file and then writing some of its contents to a new XML file in PHP.

Community
  • 1
  • 1
Django Johnson
  • 1,383
  • 3
  • 21
  • 40
  • 3
    Check out the XMLReader class: http://www.php.net/manual/en/intro.xmlreader.php It's a streaming parser. I'm reading your question more deeply right now to see if I can help with more specific answers. – DeeDee Aug 29 '13 at 18:39
  • @DeeDee I had heard of XMLReader, but didn't know how to use it. Thank you for the help! – Django Johnson Aug 29 '13 at 19:30
  • Sure! It's not too heavily used, as evidenced by the dearth of comments in the official documentation. I myself haven't used it in a very long time. Can you let me know how my code works? If it doesn't work immediately we can collaborate and figure out what's up. – DeeDee Aug 29 '13 at 19:32
  • @DeeDee Yes, sure thing. I will try it out later tonight and let you know if it works or any errors I get. Thank you for helping me and for being willing to collaborate until it is solved :-) – Django Johnson Aug 29 '13 at 19:34
  • And.. why was this closed? I'm pretty sure the question is on topic for php, xml, xml-parsing, large files.. wtf – That Realty Programmer Guy Apr 15 '14 at 15:15

2 Answers2

4

Here's a college try. This assumes a file is being used, and that you want to write to a file:

<?php

$interestingNodes = array('title','url','abstract');
$xmlObject = new XMLReader();
$xmlObject->open('bigolfile.xml');

$xmlOutput = new XMLWriter();
$xmlOutput->openURI('destfile.xml');
$xmlOutput->setIndent(true);
$xmlOutput->setIndentString("   ");
$xmlOutput->startDocument('1.0', 'UTF-8');

while($xmlObject->read()){
    if($xmlObject->name == 'doc'){
        $xmlOutput->startElement('doc');
        $xmlObject->readInnerXML();
        if(array_search($xmlObject->name, $interestingNodes)){
             $xmlOutput->startElement($xmlObject->name);
             $xmlOutput->text($xmlObject->value);
             $xmlOutput->endElement(); //close the current node
        }
        $xmlOutput->endElement(); //close the doc node
    }
}

$xmlObject->close();
$xmlOutput->endDocument();
$xmlOutput->flush();

?>
DeeDee
  • 2,641
  • 2
  • 17
  • 21
1

For this scenario you can't afford to use a DOM parser, as you stated, it will not fit in memory due to the file size, and even if you could, it'll be slow as it first load the entire file and after that you have to iterate through it, so, for this case you should try a SAX parser (event/stream oriented), add a handler for those tag you're insterested in (doc, title, url, abstract) and for every event append the node found in the new XML file.

Here you have more information:

What is the fastest XML parser in PHP?

Here is a (not tested) sample of what the code would be:

<?php
    $file = "bigfile.xml";
    $fh = fopen("out.xml", 'a') or die("can't open file");
    $currentNodeTag = "";    
    $tags = array("doc", "title", "url", "abstract");

    function startElement($parser, $name, $attrs) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            $currentNodeTag = strtolower($name);
            fwrite($fh, sprintf("<%s>\n"));
        }
    }

    function endElement($parser, $name) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            fwrite($fh, sprintf("</%s>\n"));
            $currentNodeTag = "";
        }
    }

    function characterData($parser, $data) {
        if (!empty($currentNodeTag)) {
            fwrite($fh, $data);
        }
    }    

    $xmlParser = xml_parser_create();
    xml_set_element_handler($xmlParser, "startElement", "endElement");
    xml_set_character_data_handler ($xmlParser, "characterData");

    if (!($fp = fopen($file, "r"))) {
        die("could not open XML input");
    }

    while ($data = fread($fp, 4096)) {
        if (!xml_parse($xmlParser, $data, feof($fp))) {
            die(sprintf("XML error: %s at line %d",
                        xml_error_string(xml_get_error_code($xmlParser)),
                        xml_get_current_line_number($xmlParser)));
        }
    }

    xml_parser_free($xmlParser);
    fclose($fh);
?>
Community
  • 1
  • 1
higuaro
  • 15,730
  • 4
  • 36
  • 43
  • I'm getting an error with the code that I can't seem to fix. It also doesn't make sense. The error I am getting is: `PHP Parse error: syntax error, unexpected ';' in /Users/irfanm/Desktop/mamp/xml2.php on line 12'. – Django Johnson Aug 29 '13 at 23:24