2

My goal is to add XML elements to an existing XML file (quite big, ~90MB), while parsing a CSV file to know which element I should target. The hard part is that both the XML and CSV files are big. The CSV file has 720k lines, so my script is taking multiple days to complete, which is not really acceptable.

After some search, I found a way to "remove" the bottleneck (if it existed) from the CSV file parsing, by using a generator instead of building a 720k elements array in memory. It is probably better that way but it's still way too slow in general.

Here's a sample of the code :

<?php 
$xml = simplexml_load_file('input/xmlfile.xml');
$csv = Utils::parseCSVWithGenerator('input/csvfile.csv', '$');
/* CSV SAMPLE
Header: id_1$id_2$id_3$id_4
l1:     521$103$490$19
Only 2 columns are necessary
*/

foreach ($csv as $key => $line) {
    var_dump('key: '.$key);
    $target = $xml->xpath('/Root/Book/Part/Part/Child[@id="'.$line['id_2'].'"]')[0];
    if (empty($target)) {
        var_dump($line['id_2']);
    } else {
        // If $target exists, $indexChild exists too, and we need to retrieve the Name element from it
        $indexChild = $xml->xpath('/Root/Child[@id="'.$line['id_3'].'"]')[0];
        $newElement = new SimpleXMLElement('<newElement id="'.$line['id_3'].'"></newElement>');
        $newElement->addChild('Name', (string) $indexChild->Name);
        Utils::simplexml_append($newElement, $target);
    }
}

class Utils {
    public static function parseCSVWithGenerator($filename, $delimiter) {
        $fp = fopen($filename, 'r');
        $csv = [];
        $header = fgetcsv($fp, 0, $delimiter);
        $key = 0;

        while( ($data = fgetcsv($fp, 0, $delimiter)) !== FALSE ) {
            $key++;
            yield $key => array_combine($header, $data);
        }
        fclose($fp);
    }

    public static function simplexml_append(SimpleXMLElement $child, SimpleXMLElement $parent) {
        $parent_dom = dom_import_simplexml($parent);
        $child_dom = dom_import_simplexml($child);

        $child_dom = $parent_dom->ownerDocument->importNode($child_dom, TRUE);

        return $parent_dom->appendChild($child_dom);
    }
}

For what it's worth, I tried to convert the CSV into a sqlite database, but overall the speed was not significantly different.

I'm guessing the heavy part is inside the loop, because we create / add / change the DOM of a big XML file which is getting bigger and bigger.

Any idea to save execution time ? Should I look into multithreading ? My computer does have a quad core processor and only one is used. Should I change libraries / languages ? I'm just throwing ideas and I'm open to any suggestions, because at the moment, I can't really rely on this script to process big files like these.

Kevin B.
  • 43
  • 1
  • 7
  • Instead of using `simplexml_load_file`, you can use [DomDocument](http://php.net/domdocument) or [XmlParser](http://php.net/manual/en/book.xml.php) which is actually for large xml files *(please note that you won't be able to add anything to xml with xmlparser)*. to parse csv, you can always use mysql and [LOAD DATA](http://dev.mysql.com/doc/refman/5.0/en/load-data.html) LOAD DATA INFILE "input/csvfile.csv" INTO TABLE CSVImport COLUMNS TERMINATED BY '$' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES; – Cemal Feb 13 '18 at 12:47
  • 1
    @Cemal PHP's [DOM and SimpleXML aren't actually two different parsers](https://stackoverflow.com/a/4817550/157957). They have exactly the same in-memory implementation, and can be used interchangeably - indeed, this code already does just that, in the `simplexml_append` method. Neither is "actually for large XML files". An event based XML Parser, or the pull-based XMLReader class might help, but it's not obvious what the algorithm would be in this use case. – IMSoP Feb 13 '18 at 12:49
  • @IMSoP XmlParser is for large xml files. Sorry if my sentence created ambiguity. But XmlParser doesn't provide any methods to manipulate the existing xml. – Cemal Feb 13 '18 at 12:56
  • If your XML and CSV are in key order, you could read them in parallel, use your current file reader and XMLReader and XMLWriter to output the updated/new content. – Nigel Ren Feb 13 '18 at 20:04

1 Answers1

1

For every line in the CSV file you are constructing and evaluating an XPath expression of the form

/Root/Child[@id="'....'"]')[0]

The first obvious inefficiency is that you really don't want to compile a new XPath expression each time; you should use the same compiled expression with a parameter. (I don't know the PHP APIs in detail, I'm just looking at the general principles.)

But even then, this expression is likely to take time proportional to the size of the XML document. You need an index of some kind.

I'll tell you how I would do this. You might not like the solution, but perhaps it will give you ideas.

I would write it in XSLT 3.0 (available to PHP users via the Saxon/C product). I would write the transformation so that it first indexes the entries in the CSV file as a map, and then processes all the records in the XML file, checking each one to see if there is a corresponding entry in the CSV input. Something like this:

<xsl:param name="csvFileName" as="xs:string"/>
<xsl:variable name="csvMap" as="map(*)">
  <xsl:map>
    <xsl:for-each select="unparsed-text-lines($csvFileName)">
      <xsl:variable name="fields" select="tokenize(., ',')"/>
      <xsl:map-entry key="$fields[1]" select="$fields"/>
    </xsl:for-each>
  </xsl:map>
</xsl:variable>

<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="/Root/Child[map:contains($csvMap, @id)]">
  <xsl:variable name="csvRecord" select="$csvMap(@id)"/>
  <xsl:copy>
    <newElement id="{@id}"
      x="{$csvRecord[2]}" y="{$csvRecord[3]}"/>
  </xsl:copy>
</xsl:template>  

Of course this is only the gist: without seeing the detailed structure of your input files or your desired output, it's the best I can do.

If you prefer to use the XSLT 1.0 processor that comes with PHP, that can also probably be done but it will be rather more verbose: you'll have to convert the CSV file to XML in the calling application, and you can use XSLT keys to get efficient access to it in place of constructing a map.

Note that doing the join this way around is probably better because the CSV file is the smaller of the two, so the map structure is smaller that way; and with XSLT 3.0 the processing of the XML file can be made fully streamed, so it scales beyond your current size of 90Mb without difficulty (streaming starts to become essential at around the 200Mb mark, depending on how much memory you want to allocate).

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 1
    A variant of this with fewer dependencies and new languages to learn would be to build the CSV map in PHP, or as a random-access file, database table, etc, and then use an [event-based](http://php.net/manual/en/function.xml-parser-create.php) or [pull-based](http://php.net/manual/en/book.xmlreader.php) XML parser to stream through the XML file looking up the appropriate CSV values for each node. Essentially "emulating" what the XSLT processor would do with this solution. – IMSoP Feb 13 '18 at 12:54
  • True, but anyone doing anything this complex with XML is wasting their employer's time if they don't master XSLT at some stage. – Michael Kay Feb 13 '18 at 14:42
  • Sure, I upvoted the answer before I commented, because this does seem to be an elegant solution to the problem. But we don't know without a lot more background whether the learning curve of doing it in XSLT 1.0, or the deployment requirements for using XSLT 3.0, would pay off for this specific situation, so I thought I'd offer some thoughts on how to do it without. – IMSoP Feb 13 '18 at 14:52