7

I was wondering if anyone out there knew how this could be done in PHP. I am running a script that involves opening a file, taking the first 1000 lines, doing some stuff with those lines, then the php file opens another instance of itself to take the next thousand lines and so on until it reaches the end of the file. I'm using splfileobject so that I can seek to a certain line, which allows me to break this up into 1000 line chunks quite well. The biggest problem that I'm having is with performance. I'm dealing with files that have upwards of 10,000,000 lines and while it does the first 10,000 lines or so quite fast, there is a huge exponential slowdown after that point that I think is just having to seek to that point.

What I would like to do is read the first thousand lines, then just delete them from the file so that my script is always reading the first thousand lines. Is there a way to do this without reading the rest of the file into memory. Other solutions I have seen involve reading each line into an array then getting rid of the first X entries, but with ten million lines that will eat up too much memory and time.

If anyone has a solution or other suggestions that would speed up the performance, it would be greatly appreciated.

Eric Strom
  • 715
  • 9
  • 20
  • You *think* the time is taken seeking? – salathe Mar 26 '12 at 18:24
  • I commented out the line that iterates the line counter so that it always ran the first 1000 and it ran exponentially faster. Plus this gets exponentially slower as it goes along, the only thing thats changing is the line that its seeking to. – Eric Strom Mar 26 '12 at 18:27
  • Seeking shouldn't be taking *exponentially* more time. On what sort of scale is the slowdown? – salathe Mar 26 '12 at 18:30
  • It might be worthwhile [`split`](http://linux.die.net/man/1/split)-ing your file into several *n* thousand line files, or is there some reason it must be one big file? – salathe Mar 26 '12 at 18:34
  • 1
    It might also be of interest to know that when using `SplFileObject`'s `seek()` method, the file is still being *read* all the way up to where you're seeking to (each line is read then thrown away). It is *not* the same as `fseek()`-ing to a byte offset. – salathe Mar 26 '12 at 18:36
  • The data that I'm getting from the file is used to create entries in a mysql database, so I'm monitoring the performance by number of records. The first thousand records get inserted in less than a second. The second thousand takes about five seconds, the next thousand about a minute. Once I get up to around 15,000 records, it takes about 10 minutes per thousand. Again, when I commented out the iteration, the sql records were inserted at the same speed as the first thousand continuously, so it's not a problem with size of the database. – Eric Strom Mar 26 '12 at 18:37
  • In that case, I doubt `SplFileObject::seek()` is the culprit. It should be taking in the order of second(s) at most to read 10,000,000+ lines. – salathe Mar 26 '12 at 18:40
  • My only advice here is to break down the script to find the real point that is causing the slow down. It might be `SplFileObject`'s fault (especially on Windows), but without you being able to *show* that it is the cause I would remain skeptical. – salathe Mar 26 '12 at 18:43
  • The reason that I'm using the splfileobject is because you can seek by line instead of bytes. I imagine though that that is whats causing the slowdown, because it has to seek to line 1,000,000 or whatever and is reading everything up to that line. – Eric Strom Mar 26 '12 at 18:44
  • Why not make a script that *only* seeks over the file and see if that is too slow for you? – salathe Mar 26 '12 at 18:44
  • I just did and you're exactly right. Seeking to where I tested was almost instantaneous when there was nothing else going on, so it must be somewhere else in the script. Thank you for all your help. – Eric Strom Mar 26 '12 at 18:51
  • 1
    @Eric don't seek by lines. You'll have to count lines EVERY TIME you open the file. Store the byte offset returned by `tell()` or whatever it is in spfileobject. That's a simple count of bytes to skip over, and will be very fast since PHP doesn't have to scan/count line endings. Once you've seeked to the proper location, THEN you can start counting lines. – Marc B Mar 26 '12 at 19:30

2 Answers2

1

Unfortunately there is no real solution to this because the files are always loaded fully on to the main memory before they are read.

Still, I have posted this answer because this is a possible solution but I suspect it hardly improves the performance. Correct me if I am wrong.

You can use XML to divide the files into units of 1000 lines. And use DomDocument Class of PHP to retrieve and append data. You can append the child when you want to add data and retrieve the first child to get the first thousand lines and delete the node it if you want. Just like this :

<document>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    <part>
        . . . 
        Thousand lines here
        . . . 
    </part>
    .
    .
    .
</document>

ANOTHER WAY :

If you are really sure about breaking the sections into exactly 1000 lines why don't you save it in a database with each 1000 in a different row ? By doing this you will surely reduce file read/write overhead and improve the performance.

Tabrez Ahmed
  • 2,830
  • 6
  • 31
  • 48
1

It seems to me that the objective is to parse a huge amount of data and insert it into a database? If so, I fail to understand why it's important to work with exactly 1000 lines?

I think I would just approach it by reading a big chunk of data, say 1 MB, into memory at once, and then scan backwards from the end of the in-memory chunk for the last line ending. Once I have that, I can save the file position and the extra data I have (what's left over from the last line ending until the end of the chunk). Alternatively, just reset the file pointer using fseek() to where in the file that I found the last line ending, easily accomplished with strlen($chunk).

That way, all I have to do is explode the chunk by running explode("\r\n", $chunk) and I have all the lines I need, in a suitably big block for further processing.

Deleting lines from the beginning of the file is not recommended. That's going to shuffle a huge amount of data back and forth to disk.

mgefvert
  • 98
  • 8