79

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.

mik01aj
  • 11,928
  • 15
  • 76
  • 119
pasbino
  • 801
  • 1
  • 6
  • 4

13 Answers13

93
from os import SEEK_END, SEEK_CUR

def readlast(f):
    try:
        f.seek(-2, SEEK_END)       # Jump to the second last byte.
        while f.read(1) != b"\n":  #  Until newline is found ...
            f.seek(-2, SEEK_CUR)   #  ... jump back, over the read byte plus one.
    except OSError:                # Reached begginning of File
        f.seek(0)                  #  Set cursor to beginning of file as well.
    return f.read()                # Read all data from this point on.
        
with open(path, "rb") as f:
    first = f.readline()
    last  = readlast(f)

When using seek the format is fseek(offset, whence=0)

Quote from docs.python.org:

Change the stream position to the given byte offset. offset is interpreted relative to the position indicated by whence. The default value for whence is SEEK_SET. Values for whence are:

  • SEEK_SET or 0 = start of the stream (the default); offset should be zero or positive
  • SEEK_CUR or 1 = current stream position; offset may be negative
  • SEEK_END or 2 = end of the stream; offset is usually negative

Galloping search (2.7+)

from collections import deque
from os import SEEK_CUR, SEEK_END

def readlast(f, d = b'\n'):
    """"readlast(f: io.IOBase, d: bytes = b'\n') -> bytes

    Return the last segment of file `f`, containing data segments separated by
    `d`.
    """
    arr = deque(); step = 1; pos = -1
    try:
        # Seek to last byte of file, save it to arr as to not check for newline.
        pos = f.seek(-1, SEEK_END) 
        arr.appendleft(f.read())
        # Seek past the byte read, plus one to use as the first segment.
        pos = f.seek(-2, SEEK_END) 
        seg = f.read(1)
        # Break when 'd' occurs, store index of the rightmost match in 'i'.
        while seg.rfind(d) == -1:
            # Store segments with no b'\n' in a memory-efficient 'deque'.
            arr.appendleft(seg)
            # Step back in file, past the bytes just read plus twice that.
            pos = f.seek(-step*3, SEEK_CUR)
            # Read new segment, twice as big as the one read previous iteration.
            step *= 2
            seg = f.read(step)
        # Ignore the characters up to 'i', and the triggering newline character.
        arr.appendleft(seg[seg.rfind(d)+1:])
    except OSError: 
        # Reached beginning of file. Read remaining data and check for newline.
        f.seek(0)
        seg = f.read(pos)
        arr.appendleft(seg[seg.rfind(d)+1:])
    return b"".join(arr)

I'd probably go for a function that make use of an exponentially growing step size today and thus added such an example here, and will keep it alongside the the original answer (for now).

It handles edge cases well, apart from multibyte delimiters and files opened in text mode (see "Edge cases" for an example that handle those).

Usage:

f.write(b'X\nY\nZ\n'); f.seek(0)
assert readlast(f) == b'Z\n'
f.write(b'\n\n'; f.seek(0)
assert readlast(f) == b'\n'

Edge cases (2.7+)

I've refrained from editing the original answer as the question is specifically asks for efficiency, as well as to respect previous upvotes.

This version address all comments and issues raised over the years while preserving the logic and backward compatibility (at the cost of readability).

The issues raised and addressed at the point of writing is:

  • Return empty string when parsing empty file, noted in comment by Loïc.
  • Return all content when no delimiter is found, raised in comment by LazyLeopard.
  • Avoid relative offsets to support text mode, raised in comment by AnotherParker.
  • UTF16/UTF32 hack, noted in comment by Pietro Battiston.

Also supports multibyte delimiters.

from os import SEEK_CUR, SEEK_END

def _readlast__bytes(f, sep, size, step):
    # Point cursor 'size' + 'step' bytes away from the end of the file.
    o = f.seek(0 - size - step, SEEK_END)
    # Step 'step' bytes each iteration, halt when 'sep' occurs.
    while f.read(size) != sep:
        f.seek(0 - size - step, SEEK_CUR)

def _readlast__text(f, sep, size, step):
    # Text mode, same principle but without the use of relative offsets.
    o = f.seek(0, SEEK_END)
    o = f.seek(o - size - step)
    while f.read(size) != sep:
        o = f.seek(o - step)

def readlast(f, sep, fixed = False):
    """readlast(f: io.BaseIO, sep: bytes|str, fixed: bool = False) -> bytes|str

    Return the last segment of file `f`, containing data segments separated by
    `sep`.

    Set `fixed` to True when parsing UTF-32 or UTF-16 encoded data (don't forget
    to pass the correct delimiter) in files opened in byte mode.
    """
    size = len(sep)
    step = len(sep) if (fixed is True) else (fixed or 1)
    step = size if fixed else 1
    if not size:
        raise ValueError("Zero-length separator.")
    try:
        if 'b' in f.mode:
            # Process file opened in byte mode.
            _readlast__bytes(f, sep, size, step)
        else:
            # Process file opened in text mode.
            _readlast__text(f, sep, size, step)
    except (OSError, ValueError): 
        # Beginning of file reached.
        f.seek(0, SEEK_SET)
    return f.read()

Usage:

f.write("X\nY\nZ\n".encode('utf32'); f.seek(0)
assert readlast(f, "\n".encode('utf32')[4:]) == "Z\n"
f.write(b'X<br>Y</br>'; f.seek(0)
assert readlast(f, b'<br>', fixed=False) == "Y</br>"

Efficiency

Code used to compare against this answer (optimised version of the most upvoted answer [at the point of posting]):

with open(file, "rb") as f:
    first = f.readline()     # Read and store the first line.
    for last in f: pass      # Read all lines, keep final value.

Results:

10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs  6.92s
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95s

"1-2 millions lines each", as the question stated, would of course increase the difference a lot more.

Trasp
  • 1,132
  • 7
  • 11
  • 4
    This is the most concise solution, and I like it. The nice part about not guessing a blocksize is that it works well with small test files. I added a few lines and wrapped it in a function I fondly call `tail_n`. – MarkHu Jun 27 '14 at 03:56
  • 1
    I love it on the paper but can't have it to work. `File "mapper1.2.2.py", line 17, in get_last_line f.seek(-2, 2) IOError: [Errno 22] Invalid argument` – Loïc Sep 29 '14 at 14:12
  • 2
    As per [this comment](http://stackoverflow.com/a/31460631) as an answer, this `while f.read(1) != "\n":` should be `while f.read(1) != b"\n":` – Artjom B. Jul 17 '15 at 16:49
  • That is true for python 3, and it doesn't matter for python 2, so I've updated the answer accordingly. Thank you. – Trasp Dec 14 '15 at 21:29
  • 1
    For the records: if the file is UTF-16 encoded, replace ``f.seek(-2, 2)`` with ``f.seek(-4, 2)``, ``f.seek(-2, 1)`` with ``f.seek(-3, 1)`` and each ``f.readline()`` with ``f.readline() + '\x00'`` – Pietro Battiston Jan 14 '16 at 15:39
  • 5
    Also for the record: If you get the exception `io.UnsupportedOperation: can't do nonzero end-relative seeks`, you have to do it in two steps: first find the length of the file, then add the offset, then pass that to `f.seek(size+offset,os.SEEK_SET)` – AnotherParker Feb 08 '16 at 22:14
  • 1
    Such a simple solution! However if there is no b"\n" in the file. an IOError is thrown when a start of file is reached while seeking backwards. – LazyLeopard May 09 '18 at 11:34
  • why `f.seek(-2, os.SEEK_END)`? Do you assume there is always a '\n' at the end of the file and you want to avoid that? – Ruggero Turra Jul 12 '18 at 14:06
  • The above code returns error if there is only one line in the file and there is no \n at the end of the first line `with open(filename, "r") as f: first = f.readline() if f.read(1) == '': return first f.seek(-2, 2) # Jump to the second last byte. while f.read(1) != b"\n": # Until EOL is found... f.seek(-2, 1) # ...jump back the read byte plus one more. last = f.readline() # Read last line. return last` Modified the above code to return first line if there is only one line in file – user37940 Jul 29 '18 at 08:48
  • 1
    @PietroBattiston Great comment! But, if you still read one byte to compare against `b'\n'` like that it would break lines on any two byte UTF16 character that begins with `b'\n'`, like ✊. – Trasp Apr 08 '20 at 08:59
  • @RuggeroTurra No, I wouldn't say that. No extra line or character is read if the last byte of the file(s last line) is not a newline character (_\n_). – Trasp Apr 09 '20 at 11:26
  • For the record though, assuming good and common practice is to terminate lines written to files/buffers (with `b'\n'`) I also assumed that the expected behavior when a file/line ends with `b'\n'` would be to read the full line of text including the terminating `b'\n'`. For example, the expected result of a file created by running `{ echo "x"; echo "y";} > f` would most likely be `b'x', b'y'`, not `b'x', b''`. – Trasp Apr 09 '20 at 11:26
67

docs for io module

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

John Y
  • 14,123
  • 2
  • 48
  • 72
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
  • As long as the lines aren't longer than 1024 characters. – FogleBird Jul 27 '10 at 18:08
  • There is no guarantee that the lines aren't longer than 1024 characters, there may be some other junk besides the timestamps on the line. – pasbino Jul 27 '10 at 18:10
  • @pasbino: do you have *some* upper bound? – SilentGhost Jul 27 '10 at 18:11
  • @pasbino: You can still use a similar approach in a loop until you find a full line. – FogleBird Jul 27 '10 at 18:12
  • Unfortunately, I have not seen every line of these files. From a quick glance some of these lines seem extremely long. I don't think I can estimate an upper bound safely. – pasbino Jul 27 '10 at 18:14
  • @pasbino: 1 MB? there is always possibility to check for EOL character in the cut off chunk and cut more. – SilentGhost Jul 27 '10 at 18:16
  • Hmmmmm thats what I feared. Currently I loop over the whole file and it takes a while. But I guess without a line length upper bound there is no faster way. – pasbino Jul 27 '10 at 18:18
  • The files are about 150 MB's in size – pasbino Jul 27 '10 at 18:20
  • @pasbino: 1 MB was an example of the length of last string. – SilentGhost Jul 27 '10 at 18:21
  • Uhmmm so depending on the file its different. I just checked on a few and it seems like they're only a few kilobytes – pasbino Jul 27 '10 at 18:26
  • Sorry for so many questions but how does the seek work in your first example? It means set the file's current position to 1024 bytes from the end of the file? – pasbino Jul 27 '10 at 18:37
  • @pasbino: yes. [docs](http://docs.python.org/library/io.html?highlight=seek#io.IOBase.seek) has more information. – SilentGhost Jul 27 '10 at 18:43
  • 20
    Using `fh.seek(-1024, os.SEEK_END)` instead of `fh.seek(-1024, 2)` increases readability. – marsl May 12 '14 at 16:31
  • 3
    The following is not true: *You don't need to bother with the binary flag you could just use `open(fname)`.* Opening with `b` flag is crucial. If you use `open(fname)` instead of `open(fname, 'rb')` you [will get](https://stackoverflow.com/questions/21533391/seeking-from-end-of-file-throwing-unsupported-exception) *io.UnsupportedOperation: can't do nonzero end-relative seeks*. – patryk.beza Jul 13 '17 at 07:08
25

Here's a modified version of SilentGhost's answer that will do what you want.

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

mik01aj
  • 11,928
  • 15
  • 76
  • 119
10

Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.

beitar
  • 156
  • 4
6

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment.

Community
  • 1
  • 1
Marco Sulla
  • 15,299
  • 14
  • 65
  • 100
4

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]
4
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration.

Mr. Llama
  • 20,202
  • 2
  • 62
  • 115
VipeR
  • 49
  • 1
  • This should be the accepted answer. I don't know why there's all this messing around with low level io in the other answers? – GreenAsJade Jun 21 '16 at 14:19
  • 3
    @GreenAsJade My understanding is that the "messing around" is to avoid reading the whole file from start to end. This might be inefficient on a large file. – bli Sep 14 '17 at 14:40
3
with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row
Riccardo Volpe
  • 1,471
  • 1
  • 16
  • 30
  • Can you explain why your solution will be better ? – Zulu Jan 31 '15 at 02:26
  • Hi, I found myself in the same need, to remove the last comma at level of the last line in a text file, and in this way I solved to locate it easily; I thought then to share it. This solution has been simple, practical and immediate, but I don't know if it is the fastest in terms of efficiency. What can you tell me about it? – Riccardo Volpe Feb 03 '15 at 02:26
  • Well, it has to read and process the entire file so it seems like the least efficient way. – rakslice May 01 '15 at 23:24
  • Ok...so, if you don't know the string length, which would be the best one method? I need to try the other one (http://stackoverflow.com/a/3346492/2149425). Thank you! – Riccardo Volpe May 04 '15 at 18:43
  • 1
    use `f.readlines()[-1]` insead of new variable. **0** = *First Line*, **1** = *Second Line*, **-1** = *Last Line*, **-2** = *Line Before Last Line*... – BladeMight Aug 05 '16 at 02:17
2

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last
tony_tiger
  • 789
  • 1
  • 11
  • 25
2

Nobody mentioned using reversed:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
1

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.

msw
  • 42,753
  • 9
  • 87
  • 112
1
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Okapi575
  • 710
  • 1
  • 10
  • 24
user37940
  • 478
  • 1
  • 4
  • 17
0

If you're only looking for a convenient small snippet and it's suitable to read the whole file, consider deque.

from collections import deque

with open("/path/to/file", "rb+") as f:
    first = f.readline()
    try:
        last = deque(f, 1)[0]
    except IndexError:
        last = ""
        

Passing the file object f to deque will cause the built in functions in the io library split the stream into individual lines while deque keeps the last line in memory.

Trasp
  • 1,132
  • 7
  • 11