5

I have a script that, when put against a timer, gets progressively slower. It's fairly simple as all it does is reads a line, checks it then adds it to the database, then proceeds to the next line.

Here's the output of it gradually getting worse:

Record: #1,001 Memory: 1,355,360kb taking 1.84s
Record: #1,001 Memory: 1,355,360kb taking 1.84s
Record: #2,002 Memory: 1,355,192kb taking 2.12s
Record: #3,003 Memory: 1,355,192kb taking 2.39s
Record: #4,004 Memory: 1,355,192kb taking 2.65s
Record: #5,005 Memory: 1,355,200kb taking 2.94s
Record: #6,006 Memory: 1,355,376kb taking 3.28s
Record: #7,007 Memory: 1,355,176kb taking 3.56s
Record: #8,008 Memory: 1,355,408kb taking 3.81s
Record: #9,009 Memory: 1,355,464kb taking 4.07s
Record: #10,010 Memory: 1,355,392kb taking 4.32s
Record: #11,011 Memory: 1,355,352kb taking 4.63s
Record: #12,012 Memory: 1,355,376kb taking 4.90s
Record: #13,013 Memory: 1,355,200kb taking 5.14s
Record: #14,014 Memory: 1,355,184kb taking 5.43s
Record: #15,015 Memory: 1,355,344kb taking 5.72s

The file, unfortunately, is around ~20gb so I'll probably be dead by the time the whole thing is read at the rate of increase. The code is (mainly) below but I suspect it's something to do with fgets() , but I am not sure what.

    $handle = fopen ($import_file, 'r');

    while ($line = fgets ($handle))
    {
        $data = json_decode ($line);

        save_record ($data, $line);
    }

Thanks in advance!

EDIT:

Commenting out 'save_record ($data, $line);' appears to do nothing.

DCD
  • 1,290
  • 3
  • 12
  • 20
  • Can you post the code for save_record? That is probably the key – Jhong Aug 15 '10 at 10:08
  • Actually if I comment out the save_record () line it's still just as bad. – DCD Aug 15 '10 at 10:13
  • 1
    How are you getting that performance output? You have no performance logging in the code sample you provided. I suspect the problem is elsewhere. Do you have some more code that you're not showing us that might be relevant? – Mark Byers Aug 15 '10 at 10:21
  • Yeah, we need to see more code. And you are 100% sure those seconds are not simply the overall time progressing? Just to exclude the possibility... – Pekka Aug 15 '10 at 10:36
  • Yep, 99.9% sure, the total time adds up to 60s, which is the PHP timeout point. – DCD Aug 15 '10 at 10:38
  • @DCD can you show the full code? – Pekka Aug 15 '10 at 10:43
  • Be sure to free up memory where you can, I had a similoa problem, and at big files, the script become slower, and after a while it failed with "memory full" error. – Radu Maris Aug 16 '10 at 20:36
  • @DCD: where and how are you calculating the elapsed time? How are you calling your code? – Yanick Rochon Aug 17 '10 at 11:52

4 Answers4

1

Sometimes it is better to use system commands for reading these large files. I ran into something similar and here is a little trick I used:

$lines = exec("wc -l $filename");
for($i=1; $i <= $lines; $i++) {
   $line = exec('sed \''.$i.'!d\' '.$filename);

   // do what you want with the record here
}

I would not recommend this with files that cannot be trusted, but it runs fast since it pulls one record at a time using the system. Hope this helps.

Chuck Burgess
  • 11,600
  • 5
  • 41
  • 74
0

http://php.net/manual/en/function.fgets.php

According to Leigh Purdie comment, there are some performance issue on big files with fgets. If your JSON objects are bigger than his test lines, you might it the limits much faster

use http://php.net/manual/en/function.stream-get-line.php and specify a length limit

Johan Buret
  • 2,614
  • 24
  • 32
0

Alright, a performance problem. Obviously something is going quadratic when it shouldn't, or more to the point, something that should be constant-time seems to be linear in the number of records dealt with so far. The first question is what's the minimal scrap of code that exhibits the problem. I would want to know if you get the same problematic behavior when you comment out all but reading the file line by line. If so, then you'll need a language without that problem. (There are plenty.) Anyway, once you see the expected time characteristic, add statements back in one-by-one until your timing goes haywire, and you'll have identified the problem.

You instrumented something or other to get the timings. Make sure those can't cause a problem by executing them alone 15000 times or so.

Ian
  • 4,421
  • 1
  • 20
  • 17
0

I found this question while trying to find a way for me to more quickly go thru a 96G text file. The script I initially wrote took 15 hours to reach 0.1%...

I have tried some of the solutions suggested here, using stream_get_line, fgets and exec for sed. I ended up with a different approach that I thought I would share with anyone else stopping by this question.

Split the file up! :-)

On my freebsd box (also exists for linux and others) I have a command line utility named 'split'.

usage: split [-l line_count] [-a suffix_length] [file [prefix]]
       split -b byte_count[K|k|M|m|G|g] [-a suffix_length] [file [prefix]]
       split -n chunk_count [-a suffix_length] [file [prefix]]
       split -p pattern [-a suffix_length] [file [prefix]]

So I ran :

split -l 25000 -a 3 /data/var/myfile.log /data/var/myfile-log/

Then I ended up with 5608 files in the /data/var/myfile-log/ directory, which could then all be processed one at time with a command like :

php -f do-some-work.php /data/var/myfile-log/*
CodeReaper
  • 5,988
  • 3
  • 35
  • 56