My goal is to add XML elements to an existing XML file (quite big, ~90MB), while parsing a CSV file to know which element I should target. The hard part is that both the XML and CSV files are big. The CSV file has 720k lines, so my script is taking multiple days to complete, which is not really acceptable.
After some search, I found a way to "remove" the bottleneck (if it existed) from the CSV file parsing, by using a generator instead of building a 720k elements array in memory. It is probably better that way but it's still way too slow in general.
Here's a sample of the code :
<?php
$xml = simplexml_load_file('input/xmlfile.xml');
$csv = Utils::parseCSVWithGenerator('input/csvfile.csv', '$');
/* CSV SAMPLE
Header: id_1$id_2$id_3$id_4
l1: 521$103$490$19
Only 2 columns are necessary
*/
foreach ($csv as $key => $line) {
var_dump('key: '.$key);
$target = $xml->xpath('/Root/Book/Part/Part/Child[@id="'.$line['id_2'].'"]')[0];
if (empty($target)) {
var_dump($line['id_2']);
} else {
// If $target exists, $indexChild exists too, and we need to retrieve the Name element from it
$indexChild = $xml->xpath('/Root/Child[@id="'.$line['id_3'].'"]')[0];
$newElement = new SimpleXMLElement('<newElement id="'.$line['id_3'].'"></newElement>');
$newElement->addChild('Name', (string) $indexChild->Name);
Utils::simplexml_append($newElement, $target);
}
}
class Utils {
public static function parseCSVWithGenerator($filename, $delimiter) {
$fp = fopen($filename, 'r');
$csv = [];
$header = fgetcsv($fp, 0, $delimiter);
$key = 0;
while( ($data = fgetcsv($fp, 0, $delimiter)) !== FALSE ) {
$key++;
yield $key => array_combine($header, $data);
}
fclose($fp);
}
public static function simplexml_append(SimpleXMLElement $child, SimpleXMLElement $parent) {
$parent_dom = dom_import_simplexml($parent);
$child_dom = dom_import_simplexml($child);
$child_dom = $parent_dom->ownerDocument->importNode($child_dom, TRUE);
return $parent_dom->appendChild($child_dom);
}
}
For what it's worth, I tried to convert the CSV into a sqlite database, but overall the speed was not significantly different.
I'm guessing the heavy part is inside the loop, because we create / add / change the DOM of a big XML file which is getting bigger and bigger.
Any idea to save execution time ? Should I look into multithreading ? My computer does have a quad core processor and only one is used. Should I change libraries / languages ? I'm just throwing ideas and I'm open to any suggestions, because at the moment, I can't really rely on this script to process big files like these.