2

I'm scraping this huge xml file (300k lines ~ 11MB) with Simple Html Dom and having some issues with memory limits. So I added some php.ini comands to override default settings and enable full control of memory. Bad idea.

My code:

include('simple_html_dom.php');
ini_set('memory_limit', '-1');
ini_set('max_execution_time', '-1');
$xml = file_get_contents('HugeFile.xml'); 
$xml2 = new simple_html_dom();
$xml2->load($xml);

foreach($xml2->find('tag1') as $element) {
        $element->innertext = str_replace('text to replace','new text',$element>innertext);

    }

$html->save('output'.xml');    
}

Now, Is there a way to make this script work smoothly in a reasonable time without any memory issue? This can be done easily with a text editor, but I need to automate it as I have plenty of files to edit.

Zakaria
  • 1,040
  • 3
  • 13
  • 28
  • 1
    Well, either get a bigger package (with higher memory limits) or split the files into smaller pieces. Reading it chunk by chunk will presumably not help as you need the DOM to be ready when iterating over the fields. – Jan Nov 02 '15 at 20:56
  • Isn't there any php function that can free the used memory inside the loop for a given number of iterations? – Zakaria Nov 02 '15 at 21:03
  • If you are running PHP FPM -- some ini_set calls are ignored. – espradley Nov 02 '15 at 21:17
  • 1
    Use a Pull Parser like [XMLReader](http://php.net/manual/en/class.xmlreader.php)? – Mark Baker Nov 02 '15 at 21:21
  • If possible, I would recommend to use an external XML parser, you won't need to increase PHP's memory limit. – Capsule Nov 03 '15 at 00:35

1 Answers1

1

Found a better way to do it: No need for the DOM here, I just str_replace stuff inside the string returned by file_get_contents then put it in another file with file_put_contents. Simple and neat:

$xml = file_get_contents('HugeFile.xml'); 
$new = str_replace('text to replace','new text',$xml);
file_put_contents('output.xml');    

And preg_replace may come in handy for complex modifications.

Zakaria
  • 1,040
  • 3
  • 13
  • 28