Efficient way to reverse iterate over large file

Question

I'm trying to iterate over a very large, ever-changing file (typically around 1.5M lines) and perform operations on each line. It's a log file, so new lines are appended at the end of the file. My program will allow users to specify parameters each line must match and return the most recent matches. As a result, I'd like to start at the end of the file and work up to make the program efficient (instead of making a list of lines and reversing it).

Here is an example situation:

2016-01-01 01:00 apple

2016-01-02 05:00 banana

2016-01-03 03:00 apple

2016-01-04 00:00 apple

2016-01-05 12:00 banana

If a user requested 1 line that matched "apple," I'd like to return "2016-01-04 00:00 apple," the line closest to the end of the file. This is not difficult when there are only five lines, but performance suffers when there are millions. I've tried using tail -n [file size] to start at the end of the file, but this method does not scale well; I cannot use an iteration to improve performance (if the result is the last line in the file, I don't want to iterate through 1,500,000 lines).

Another method I've tried is breaking the file into "chunks":

|
| Remaining lines
|

...

|
| Second group of n lines
|

|
| First group of n lines
|

I would then use GNU sed to stream only the lines in each chunk. I found, however, that the performance of the program had hardly improved (and actually suffered when n was smaller).

Is there a better way of doing this (minimizing run-time while iterating over a file)? I've been using other programs from the Linux command line (through "subprocess"), but it may be nice to use something built into Python. I would much appreciate any information that would lead me in the right direction.

I am using Linux with access to Python 2.7.3, 2.7.10, 2.7.11-c7, 3.3.6, and 3.5.1.

This question gets asked a lot - like, _a lot_. Have you googled yet? — Nelewout, Aug 02 '16 at 15:01
Possible dublicate: http://stackoverflow.com/questions/3346430/what-is-the-most-efficient-way-to-get-first-and-last-line-of-a-text-file — magu_, Aug 02 '16 at 15:02
Yes, of course. The issue is limiting the number of lines that are viewed (i.e. if a user wants 5 results and they are found in the first 10 lines opened, I don't want to read the rest of the file) and working backward through the file. Is simply iterating through the file (`for line in reversed(open(file).readlines())`) my best option? — robben, Aug 02 '16 at 15:06
@robben , I suggest that http://stackoverflow.com/a/23646049/8747 is a better option for you than `reversed(lines)`. — Robᵩ, Aug 02 '16 at 15:12
Thanks, Rob! The answer you linked works perfectly. On average, I can read 115,000 lines per second per core, which is far better performance than I had hoped for. — robben, Aug 02 '16 at 16:11

score 0 · Answer 1 · answered Aug 02 '16 at 15:05

0

After you open a file, you can use the file handle's seek(bytes, start_point) method to skip to an arbitrary location in the file, denoted by a number of bytes. For example:

with open(my_file) as f:
    f.seek(1024, 0)
    for line in f:
        print(line)

This will print every line in the file, except for the first kilobyte. If you supply a negative number, it'll go backwards, and supplying a value of 2 to the second argument will make it count from the end of the file. Hence, a call of f.seek(-1024, 2) would have caused the above to only print the last kilobyte of the file.

May need some security measures to prevent it from dying when the file is smaller than your chunk size, but that's how I'd do it. (And if it turns out that you need to go back further, that's also quite trivial: simply call seek again.)

answered Aug 02 '16 at 15:05

acdr

4,538
2
19
45

2

But, heed [this warning](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects): "In text files (those opened without a `b` in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with `seek(0, 2)`) and the only valid offset values are those returned from the `f.tell()`, or zero. Any other offset value produces undefined behaviour." – Robᵩ Aug 02 '16 at 15:11
Probably because `tell` and `seek` use byte offsets. If you pass an arbitrary value you may end up half-way through a multibyte character. Very good point, especially as this bug will be hard to find. – spectras Aug 02 '16 at 15:22

score 0 · Answer 2 · edited May 23 '17 at 10:28

0

You can use:

for line in reversed(open("filename").readlines()):
    print line.rstrip()

And in Python 3:

for line in reversed(list(open("filename"))):
    print(line.rstrip())

This was already answered here: Read a file in reverse order using python

edited May 23 '17 at 10:28

Community

1
1

answered Aug 02 '16 at 15:08

ferdy

7,366
3
35
46

1

This will be very very slow for large files. – Ami Tavory Aug 02 '16 at 15:11

Efficient way to reverse iterate over large file

2 Answers2