0

I'm trying to handle a large (20mb+) XML file but having problems parsing it due to its size. I want to break it down into smaller segments, say 100 property nodes per file.

Currently I'm separating the files based on other criteria using this code below but I'm a bit unsure how to proceed adapting the code to perform 100 records per file split:

$destination = new DOMDocument;
$destination->preserveWhiteSpace = true;
$destination->loadXML('<?xml version="1.0" encoding="utf-8"?><root></root>');

$source = new DOMDocument;
$source->load('bes1c8ca168f910.xml');

$xp = new DOMXPath($source);
$destRoot = $destination->getElementsByTagName("root")->item(0);

foreach ($xp->query('/root/property[rent]') as $item) {
    $newItem = $destination->importNode($item, true);
    $destRoot->appendChild($newItem);
    $item->parentNode->removeChild($item);
}

$source->save("sales.xml");
$destination->formatOutput = true;
$destination->save("rentals.xml");

Any advice is appreciated thanks.

d1ch0t0my
  • 443
  • 7
  • 22
  • If you have a problem parsing this due to it's size, then I am not sure trying to split it _using_ a parser again is the solution. Anyway, if you want to try, then it's pretty simple - keep a counter in your loop over the nodes, and when that reaches 100 it is time to close and save the current document, and start a new one ... – CBroe Nov 08 '17 at 20:59
  • 1
    Consider using XMLReader to parse the whole file in one piece rather than splitting it. Have a look at something like https://stackoverflow.com/questions/1835177/how-to-use-xmlreader-in-php which shows how you can read a large file, but still process it in chunks. – Nigel Ren Nov 09 '17 at 07:42
  • OP - I adjusted my answer where PHP now passes a parameter to an external XSLT script to do the split similar to PHP binding parameters to SQL statements. And XSLT and SQL share similarities as special-purpose, declarative languages. I know XSLT is new and formidable but give it a chance. – Parfait Nov 09 '17 at 15:27

2 Answers2

1

Consider dynamic XSLT where PHP passes multiples of 100 to parse the range of nodes in the source XML into smaller outputs using XSLT's position(). Specifically, PHP passes a loop variable as a parameter into XSLT binded to $splitnum (very similar to SQL parameterization).

Input (assuming an XML structure like your previous post)

<root>
    <property>
      <rent>
        <term>short</term>
        <freq>week</freq>
        <price_peak>5845</price_peak>
        <price_high>5845</price_high>
        <price_medium>4270</price_medium>
        <price_low>3150</price_low>
      </rent>
    </property>
    <property>
      <rent>
        <term>long</term>
        <freq>week</freq>
        <price_peak>6845</price_peak>
        <price_high>6845</price_high>
        <price_medium>4270</price_medium>
        <price_low>3150</price_low>
      </rent>
    </property>
    ...
</root>

XSLT

(save as .xsl file, a special .xml file; script expects a parameter to be passed in for node range split)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" />
   <xsl:strip-space elements="*" />

   <xsl:param name="splitnum" />

   <xsl:template match="/root">
      <xsl:copy>
         <xsl:variable name="currsplit" select="$splitnum - 99"/>
         <xsl:apply-templates select="property[position() &gt;= $currsplit and 
                                               position() &lt;= $splitnum]" />
      </xsl:copy>
   </xsl:template>

   <xsl:template match="property">
      <xsl:copy>
         <xsl:copy-of select="*" />
      </xsl:copy>
   </xsl:template>

</xsl:stylesheet>

PHP

(passes loop iterator variable into XSLT as a parameter; produces 10 XMLs, each with successive 100 property nodes, extend limit of 1000 as needed)

// Load XML and XSL
$xml = new DOMDocument;
$xml->load('Input.xml');

$xsl = new DOMDocument;
$xsl->load($xslstr);

$prop_total = $xml->getElementsByTagName('property')->length + 100;

for($i=1; $i<=$prop_total; $i++){
  if ($i % 100 == 0) {         
    // Configure transformer
    $proc = new XSLTProcessor;
    $proc->importStyleSheet($xsl);

    // Binds loop variable to XSLT parameter
    $proc->setParameter('', 'splitnum', $i);

    // Transform XML source
    $newXML = new DOMDocument;
    $newXML = $proc->transformToXML($xml);

    // Output file
    file_put_contents('rentals_'.$i.'.xml', $newXML);
  }
}
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I tried your code and adapted only the paths etc. It splits the file but only runs through 2 property nodes per file. The earlier code you posted worked fine and split the file perfectly but I didn't keep a copy when updating to and testing your latest revision. – d1ch0t0my Nov 09 '17 at 17:28
  • Whoops! Sorry, simply change in XSLT: `$splitnum - 1` to `$splitnum - 99`. See edit. By the way, I must thank you for this question. I never passed a parameter into XSLT with PHP. If not yours, certainly this is going into my library! – Parfait Nov 09 '17 at 18:40
  • Works great! I'm glad we could help each other Parfait. I am going to study XSLT in more depth as it seems quite handy for what I will be doing :). Thanks again. – d1ch0t0my Nov 09 '17 at 18:48
1

An example of splitting a file using XMLReader. I've tried to make it flexible, so the filename is used as the basis of the files being created and the split count is defined as a variable.

The main part of the code is a loop which reads the <property> elements, you could tune this as needed. I've also used $rootNodeName as a place holder for whatever your root node is called.

$fileName = "data/t1.xml";
$original = new XMLReader;
$original->open($fileName);
$path_parts = pathinfo($fileName);
$filePrefix = $path_parts['dirname'].'/'.$path_parts['filename'].'-';
$nextRecord = 0;
$splitCount = 2;
$rootNodeName = "data";

$doc = new DOMDocument();
$doc->loadXML("<$rootNodeName/>");
while ($original->read() && $original->name !== 'property');
while ($original->name === 'property')
{
    $newNode = $doc->importNode($original->expand(), true);
    $doc->documentElement->appendChild($newNode);
    $nextRecord++;

    if ( $nextRecord % $splitCount == 0 )   {
        $nextFileName = $filePrefix.$nextRecord.".".$path_parts['extension'];
        $doc->save($nextFileName);
        $doc = new DOMDocument();
        $doc->loadXML("<$rootNodeName/>");
    }
    $original->next('property');
}
if ( $nextRecord % $splitCount != 0 )   {
    $nextFileName = $filePrefix.$nextRecord.".".$path_parts['extension'];
    $doc->save($nextFileName);
}

It's not the most elegant code, but it could also form the basis of a program that deals with the elements one by one rather than loading the whole document in at one time.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55