I'm trying to extract data file from many html files. In order to do it fast I don't use DOM parser, but simple strpos()
. Everything goes well if I generate from circa 200000 files. But if the do it with more files (300000) it outputs nothing, and do this strange effect:
Look at the bottom diagram. (The upper is the CPU) In the first (marked RED) phase the output filesize is growing, everything seems OK. After that the (marked ORANGE) file size become zero and the memory usage is growing. (Everything are two times, because I restarted the computing at halftime)
I forget to say that I use WAMP.
I have tired unset variables, put loop into function, using implode instead of concatenating strings, using fopen instead of filegetcontents and garbage collection too...
What is the 2nd phase? Am I out of memory? Is there some limit that I don't know (max_execution_time,memory_limit - are already ignored)? Why does this small program use so much memory?
Here is the code.
$datafile = fopen("meccsek2b.jsb", 'w');
for($i=0;$i<100000;$i++){
$a = explode('|',$data[$i]);
$file = "data2/$mid.html";
if(file_exists($file)){
$c = file_get_contents($file);
$o = 0;
$a_id = array();
$a_h = array();
$a_d = array();
$a_v = array();
while($o = strpos($c,'<a href="/test/',$o)){
$o = $o+15;
$a_id[] = substr($c,$o,strpos($c,'/',$o)-$o);
$o = strpos($c,'val_h="',$o)+7;
$a_h[] = substr($c,$o,strpos($c,'"',$o)-$o);
$o = strpos($c,'val_d="',$o)+7;
$a_d[] = substr($c,$o, strpos($c,'"',$o)-$o);
$o = strpos($c,'val_v="',$o)+7;
$a_v[] = substr($c,$o,strpos($c,'"',$o)-$o);
}
fwrite($datafile,
$mid.'|'.
implode(';',$a_id).'|'.
implode(';',$a_h).'|'.
implode(';',$a_d).'|'.
implode(';',$a_v).
PHP_EOL);
}
}
fclose($datafile);
Apache error log. (expires in 30 days)
I think I found the problem:
There was an infinite loop because strpos()
returned 0.
The allocated memory size was growing until an exception:
PHP Fatal error: Out of memory
Ensino's note was very useful about using command line,that lead me finally to this question.