I'm trying to iterate over a very large, ever-changing file (typically around 1.5M lines) and perform operations on each line. It's a log file, so new lines are appended at the end of the file. My program will allow users to specify parameters each line must match and return the most recent matches. As a result, I'd like to start at the end of the file and work up to make the program efficient (instead of making a list of lines and reversing it).
Here is an example situation:
2016-01-01 01:00 apple
2016-01-02 05:00 banana
2016-01-03 03:00 apple
2016-01-04 00:00 apple
2016-01-05 12:00 banana
If a user requested 1 line that matched "apple," I'd like to return "2016-01-04 00:00 apple," the line closest to the end of the file. This is not difficult when there are only five lines, but performance suffers when there are millions. I've tried using tail -n [file size]
to start at the end of the file, but this method does not scale well; I cannot use an iteration to improve performance (if the result is the last line in the file, I don't want to iterate through 1,500,000 lines).
Another method I've tried is breaking the file into "chunks":
|
| Remaining lines
|
...
|
| Second group of n lines
|
|
| First group of n lines
|
I would then use GNU sed
to stream only the lines in each chunk. I found, however, that the performance of the program had hardly improved (and actually suffered when n was smaller).
Is there a better way of doing this (minimizing run-time while iterating over a file)? I've been using other programs from the Linux command line (through "subprocess"), but it may be nice to use something built into Python. I would much appreciate any information that would lead me in the right direction.
I am using Linux with access to Python 2.7.3, 2.7.10, 2.7.11-c7, 3.3.6, and 3.5.1.