0

I'm trying to extract data file from many html files. In order to do it fast I don't use DOM parser, but simple strpos(). Everything goes well if I generate from circa 200000 files. But if the do it with more files (300000) it outputs nothing, and do this strange effect: Look at the bottom diagram. (The upper is the CPU) In the first (marked RED) phase the output filesize is growing, everything seems OK. After that the (marked ORANGE) file size become zero and the memory usage is growing. (Everything are two times, because I restarted the computing at halftime)

I forget to say that I use WAMP.

I have tired unset variables, put loop into function, using implode instead of concatenating strings, using fopen instead of filegetcontents and garbage collection too...

What is the 2nd phase? Am I out of memory? Is there some limit that I don't know (max_execution_time,memory_limit - are already ignored)? Why does this small program use so much memory?

processing

Here is the code.

$datafile = fopen("meccsek2b.jsb", 'w');
for($i=0;$i<100000;$i++){
    $a = explode('|',$data[$i]);
    $file = "data2/$mid.html";
    if(file_exists($file)){
        $c = file_get_contents($file);
        $o = 0;
        $a_id = array();
        $a_h = array();
        $a_d = array();
        $a_v = array();
        while($o = strpos($c,'<a href="/test/',$o)){
            $o = $o+15;
            $a_id[] = substr($c,$o,strpos($c,'/',$o)-$o);
            $o = strpos($c,'val_h="',$o)+7;
            $a_h[] = substr($c,$o,strpos($c,'"',$o)-$o); 
            $o = strpos($c,'val_d="',$o)+7;
            $a_d[] = substr($c,$o, strpos($c,'"',$o)-$o);
            $o = strpos($c,'val_v="',$o)+7;
            $a_v[] = substr($c,$o,strpos($c,'"',$o)-$o);        
        }
        fwrite($datafile,  
            $mid.'|'.
            implode(';',$a_id).'|'.
            implode(';',$a_h).'|'.
            implode(';',$a_d).'|'.
            implode(';',$a_v).
            PHP_EOL);       
    }
}
fclose($datafile);

Apache error log. (expires in 30 days)

I think I found the problem:

There was an infinite loop because strpos() returned 0. The allocated memory size was growing until an exception:

PHP Fatal error:  Out of memory 

Ensino's note was very useful about using command line,that lead me finally to this question.

Community
  • 1
  • 1
  • If it is a memory issue, I think it would generate a fatal error in your error logs, right? Also, have you tried opening/closing your $datafile on iterations? Maybe you might try changing up using file_get_contents with opening it as a socket as well? – Aaron Saray Aug 22 '13 at 20:33
  • Can you show the code that you're using to open and close the output file? Did you try to run the script from the command line? –  Aug 22 '13 at 20:37
  • @AaronSaray PHP error log is empty. I opened/closed it but not modified. How to open it as socket? –  Aug 23 '13 at 11:05
  • I have inserted apache error log. –  Aug 23 '13 at 11:07
  • @Enzino Code updated. I opened the script always from browser. –  Aug 23 '13 at 11:09
  • Have you tried to unset the local variables? I mean $c,$a, $a_id, $a_h, $a_d, $a_v . They are initialized on next iteration, but just for try something. – Pep Lainez Aug 23 '13 at 17:57
  • @2astalavista I mean using the command fopen() instead of file_get_contents() - perhaps the memory management will differ... – Aaron Saray Aug 23 '13 at 19:16
  • @Enzino the bounty is yours, just make an answer –  Aug 25 '13 at 11:49

3 Answers3

0

The CPU spike most likely means that PHP is doing garbage collection. In case you want get some performance at cost of bigger memory usage, you can disable garbage collection by gc_disable().

Looking at the code, I'd guess, that you've reached point where file_get_contents is reading some big file and PHP realizes it has to free some memory by running garbage collection to be able to store it's content.

The best approach how to deal with that is to read the file continuously and process it by chunks rather than having it whole in the memory.

Michal Čihař
  • 9,799
  • 6
  • 49
  • 87
  • the files loaded using ´file_get_contents´ are small (~20 KB), but many –  Aug 23 '13 at 15:53
0

Huge amount of data is going into the system internal cache. When the data of the system cache is written to disk, it might have impact on memory and performance.

There is a the system function FlushFileBuffers to enfoce writes: Please look at http://msdn.microsoft.com/en-us/library/windows/desktop/aa364451%28v=vs.85%29.aspx and http://winbinder.org/ for calling the function.

(Though, this explains not the empty file, unless there is windows bug.)

0

You should consider running your script from the command line; this way you might catch the error without digging through the error logs.
Furthermore, as stated in the PHP manual, the strpos function may return boolean FALSE, but may also return a non-boolean value which evaluates to FALSE, so the correct way to test the return value of this function is by using the !== operator:

while (($o = strpos($c,'<a href="/test/',$o)) !== FALSE){
...
}