Reading only the end of huge text file

Question

Possible Duplicate:
Get last n lines of a file with Python, similar to tail
Read a file in reverse order using python

I have a file that's about 15GB in size, it's a log file that I'm supposed to analyze the output from. I already did a basic parsing of a similar but GREATLY smaller file, with just few line of logging. Parsing strings is not the issue. The issue is the huge file and the amount of redundant data it contains.

Basically I'm attempting to make a python script that I could say to; for example, give me 5000 last lines of the file. That's again basic handling the arguments and all that, nothing special there, I can do that.

But how do I define or tell the file reader to ONLY read the amount of lines I specified from the end of the file? I'm trying to skip the huuuuuuge amount of lines in the beginning of a file since I'm not interested in those and to be honest, reading about 15GB of lines from a txt file takes too long. Is there a way to err.. start reading from.. end of the file? Does that even make sense?

It all boils down to the issue of reading a 15GB file, line by line takes too long. So I want to skip the already redundant data (redundant to me at least) in the beginning and only read the amount of lines from the end of file I want to read.

Obvious answer is to manually just copy N amount of lines from the file to another file but is there a way to do this semi-auto-magically just to read the N amount of lines from the end of the file with python?

Not a direct answer, but if you;re using nix you could accept input from stdin instead and just send the data using `tail hugefile.txt -n1000 | python myprog.py` — moopet, Sep 06 '12 at 06:36
See the answers on the duplicate question. The first is relatively platform-independent, the second works well on UNIX-based systems (using the `tail` command like @moopet suggested). — David Robinson, Sep 06 '12 at 06:37
Also, look [here](http://code.activestate.com/recipes/276149/) and [here](http://code.activestate.com/recipes/120686/) — sloth, Sep 06 '12 at 06:38
@BigYellowCactus: The "Read a file in reverse" question doesn't specify a large file, which means most of the answers would never work for one (the accepted answer uses `readlines`!!). — David Robinson, Sep 06 '12 at 06:41
Thank you all for great answers, it was a duplicate + got some great answers. The keywords I was searching with were completely off, I never even thought of using tail or any other magic like that. Thanks all for answers. Will mark it answered and give points to all that deserve it. Thanks! — Mike, Sep 06 '12 at 06:44
@DavidRobinson I don't see how 2 out of 5 (the answer with -2 score aside) are *most of the answers*. Also, one of the answers with a `readlines` solution links to two ActiveState recipies which handle big files. — sloth, Sep 06 '12 at 06:46
But yes, the huge file deffo is a big problem. But at least with the tail I can easily manage everything through pipe. — Mike, Sep 06 '12 at 06:46
@BigYellowCactus: You're right, though I like the other question. Mike: Don't worry about the answers, it'll be closed as a duplicate before long (but will still be useful as a signpost to the other questions, just as it was for you). — David Robinson, Sep 06 '12 at 06:49

score 21 · Answer 1 · answered Sep 06 '12 at 06:58

21

Farm this out to unix:

import os
os.popen('tail -n 1000 filepath').read()

use subprocess.Popen instead of os.popen if you need to be able to access stderr (and some other features)

answered Sep 06 '12 at 06:58

user1479095

397
4
7

Martijn Pieters · Answer 2 · 2018-08-27T13:59:05.637

13

You need to seek to the end of the file, then read some chunks in blocks from the end, counting lines, until you've found enough newlines to read your n lines.

Basically, you are re-implementing a simple form of tail.

Here's some lightly tested code that does just that:

import os, errno

def lastlines(hugefile, n, bsize=2048):
    # get newlines type, open in universal mode to find it
    with open(hugefile, 'rU') as hfile:
        if not hfile.readline():
            return  # empty, no point
        sep = hfile.newlines  # After reading a line, python gives us this
    assert isinstance(sep, str), 'multiple newline types found, aborting'

    # find a suitable seek position in binary mode
    with open(hugefile, 'rb') as hfile:
        hfile.seek(0, os.SEEK_END)
        linecount = 0
        pos = 0

        while linecount <= n + 1:
            # read at least n lines + 1 more; we need to skip a partial line later on
            try:
                hfile.seek(-bsize, os.SEEK_CUR)           # go backwards
                linecount += hfile.read(bsize).count(sep) # count newlines
                hfile.seek(-bsize, os.SEEK_CUR)           # go back again
            except IOError, e:
                if e.errno == errno.EINVAL:
                    # Attempted to seek past the start, can't go further
                    bsize = hfile.tell()
                    hfile.seek(0, os.SEEK_SET)
                    pos = 0
                    linecount += hfile.read(bsize).count(sep)
                    break
                raise  # Some other I/O exception, re-raise
            pos = hfile.tell()

    # Re-open in text mode
    with open(hugefile, 'r') as hfile:
        hfile.seek(pos, os.SEEK_SET)  # our file position from above

        for line in hfile:
            # We've located n lines *or more*, so skip if needed
            if linecount > n:
                linecount -= 1
                continue
            # The rest we yield
            yield line

edited Aug 27 '18 at 13:59

answered Sep 06 '12 at 07:32

Martijn Pieters

1,048,767
296
4,058
3,343

How do you print the yielded lines? – Superdooperhero Aug 18 '17 at 20:25
Gives me:Traceback (most recent call last): File "tail3.py", line 45, in lastlines(r"E:\D_Backup\Downloads\googlebooks-eng-all-2gram-20120701-_NOUN_", 1000, bsize=2048) File "tail3.py", line 21, in lastlines linecount += hfile.read(bsize).count(sep) # count newlines TypeError: a bytes-like object is required, not 'str' – Superdooperhero Aug 18 '17 at 20:30
1

@Superdooperhero: the code was written for Python 2, not Python 3. You'd have to use `sep.encode()` to get a `bytes` object instead. – Martijn Pieters Aug 18 '17 at 20:54
This is a brilliant answer! Loved how you are counting newlines while reading chunks. Makes this an efficient solution! – Tushar Vazirani Aug 25 '18 at 21:34
I think I found a bug though. When you are breaking form the loop in the case of seeking past the start, you may want to add pos=0 or pos=hfile.tell() to yield all the lines of the file since the number of requested lines exceed the number of lines in the file. – Tushar Vazirani Aug 25 '18 at 21:46
@TusharVazirani: well spotted; I added `pos = 0` will fix that. – Martijn Pieters Aug 27 '18 at 13:59
this fixed my problem regarding counting lines: `linecount += hfile.read(bsize).count(ord(sep))` – Behnam Shomali Apr 23 '19 at 13:12
1

@BehnamYousefi: this answer was written for Python 2; if you need to use `ord(sep)` then you are using Python 3 instead. You could just use `sep.encode()` instead. – Martijn Pieters Apr 23 '19 at 13:15

score -1 · Answer 3 · answered Sep 06 '12 at 07:25

Even though I would prefer the 'tail' solution - if you know the max number of characters per line you can implement another possible solution by getting the size of the file, open a file handler and use the 'seek' method with some estimated number of characters you are looking for.

This final code should look somehing like this - just to explain why I also prefer the tail solution :) goodluck!

MAX_CHARS_PER_LINE = 80
size_of_file = os.path.getsize('15gbfile.txt')
file_handler = file.open('15gbfile.txt', "rb")
seek_index = size_of_file - (number_of_requested_lines * MAX_CHARS_PER_LINE)
file_handler.seek(seek_index)
buffer = file_handler.read()

you can improve this code by analyzing newlines of the buffer you read. Good luck ( and you should use the tail solution ;-) i am quite sure you can get tail for every OS)

score -2 · Answer 4 · edited Sep 06 '12 at 06:52

The preferred method at this point was just to use unix's tail for the job and modify the python to accept input through std input.

tail hugefile.txt -n1000 | python magic.py

It's nothing sexy but at least it takes care of the job. The big file is a too big of a burden to handle, I found out. At least for my python skills. So it was a lot easier just to add a pinch of nix magic to it to cut down the filesize. Tail was new one for me so. Learned something and figure out another way of using the terminal to my advantage again. Thank you everyone.

Reading only the end of huge text file

4 Answers4