1

Came across some different methods for reading files in Python, I was wondering which is the fastest way to do it.

For example reading the last line of a file, one can do

input_file = open('mytext.txt', 'r')
lastLine = ""
  for line in input_file:
    lastLine = line

print lastLine # This is the last line

Or

fileHandle = open('mytext.txt', 'r')
lineList = fileHandle.readlines()
print lineList[-1] #This is the last line

I'm assuming for that particular case this may be not really relevant discussing efficiency...

Question:

1. Which method is faster for picking a random line

2. Can we deal with concepts like "SEEK" in Python (if so is it faster?)

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Jaay
  • 2,103
  • 1
  • 16
  • 34
  • 7
    Yes, Python supports `.seek()` calls on file objects. Why don't you do some tests yourself? `timeit` is the module to use to properly test small code snippets and compare timings. – Martijn Pieters Jun 27 '13 at 14:38
  • 3
    Method two is very fast for files that are much larger than the available memory. It will crash long before it's read the entire file. – Fred Foo Jun 27 '13 at 14:40
  • And for *random* lines from a file you'd use a different technique altogether. See [Python random lines from subfolders](http://stackoverflow.com/q/12128948) for a discussion on how to pick random lines from files. – Martijn Pieters Jun 27 '13 at 14:41
  • @larsmans I think I am missing something: why is the amount of *physical* memory is relevant? – Elazar Jun 27 '13 at 14:41
  • @MartijnPieters I do know how to pick a random line from a file, the question is that it may be faster for the first or last line with some method, and faster for a random (let say a line in the middle of the file) with some other functions – Jaay Jun 27 '13 at 14:44
  • @Elazar: s/RAM/memory/; indeed, if you have swap on, this may be terribly *slow*. – Fred Foo Jun 27 '13 at 14:44
  • @Jaay: To pick a *uniformly random* line from a file you do have to scan through the whole file. That is a very different proposition from picking the first or last line. – Martijn Pieters Jun 27 '13 at 14:45
  • @larsmans Ok. for performance it is relevant of course. – Elazar Jun 27 '13 at 14:46
  • @MartijnPieters I'm assuming that with a `for` the file have to be scanned, but is it the case with `seek()` ? – Jaay Jun 27 '13 at 14:48
  • @Jaay: no, seek is an O(1) operation, at least in theory. But it won't seek to a specified *line*. – Fred Foo Jun 27 '13 at 14:51
  • 4
    @Jaay: The problem with `.seek()` is that unless all your lines are of uniform length, you have *no idea* where you'll end up. You'll have to scan backward or forward for a newline. – Martijn Pieters Jun 27 '13 at 14:52
  • @Jaay if you are wanting a specific "line" then `seek` is not very useful unless all lines are of some uniform length. `seek` takes a file offset not a line number. – cmd Jun 27 '13 at 14:52
  • How big is this file? If its not too big, just read the whole file and leave it in a list. Then when ever you need a random line, just `random.choice()` one. You end up paying the cost of reading once, and the memory space, but it will be vary fast to use. – cmd Jun 27 '13 at 15:10
  • If it helps any, whenever I get the last line, I use `re.search('(\n.+?$)', f.read()).group(0).strip('\n')`. – kirbyfan64sos Jun 27 '13 at 15:16
  • @cmd - seek will work for arbitrary sized lines but you have to put some work into it. Seek near the end, read and scan for new lines, seek a bit further back, and etc... `tail` does it quite handily. – tdelaney Jun 27 '13 at 15:16
  • The file is not really big in my case (like 100-200 lines), so yeah all the functions for retrieving a certain line will be fast – Jaay Jun 27 '13 at 15:17
  • @Jaay - okay,that's a small file! The expensive part isn't the scan - its the read from disk. Just open with buffering at 64K and it'll get read in one shot. – tdelaney Jun 27 '13 at 15:21
  • If someone can give a full response to close that discussion that'd be great :) – Jaay Jun 28 '13 at 08:05

2 Answers2

1

If you don't need a uniform distribution (i.e. it's okay that the chance for some line to be picked is not equal for all lines) and/or if your lines are all about the same length then the problem of picking the random line can be simplified to:

  1. Determine the size of the file in bytes
  2. Seek to a random position
  3. Search for the last newline character if any (there may be none if there's no preceding line)
  4. Pick all text up to the next newline character or the end of file, whichever comes first.

For (2) you do an educated guess for how far you've got to search backwards to find the previous newline. If you can tell that a line is n bytes on average then you could read the previous n bytes in a single step.

Frerich Raabe
  • 90,689
  • 19
  • 115
  • 207
0

I had this problematic few days ago and I use this solution. My solution is similar to the @Frerich Raabe one, but with no random, just logic :)

def get_last_line(f):
    """ f is a file object in read mode, I just extract the algorithm from a bigger function """
    tries = 0
    offs = -512

    while tries < 5:
        # Put the cursor at n*512nth character before the end.
        # If we reach the max fsize, it puts the cursor at the beginning (fsize * -1 means move the cursor of -fsize from the end)
        f.seek(max(fsize * -1, offs), 2)
        lines = f.readlines()
        if len(lines) > 1:   # If there's more than 1 lines found, then we have the last complete line
            return lines[-1]  # Returns the last complete line
        offs *= 2
        tries += 1

    raise ValueError("No end line found, after 5 tries (Your file may has only 1 line or the last line is longer than %s characters)" % offs)

The tries counters avoid to be block if the file has also one line (a very very long last line). The algorithm tries to get the last line from the last 512 characters, then 1024, 2048... and stop if there's still no complete line at the th iteration.

Maxime Lorant
  • 34,607
  • 19
  • 87
  • 97
  • 1
    The `end` variable is never initialized (nor set). Also, PEP8 spelling is preferred in Python - `get_last_line` rather than `getLastLine`. – user4815162342 Aug 26 '13 at 12:29
  • You can remove the `end` variable altogether, it's never set anywhere. `while tries < 5: ...` BTW if tries reaches 5, you might want to fall back to `f.seek(0); return f.readlines()[-1]`, or raise an exception. Silently returning `None` isn't doing the caller a favor. – user4815162342 Aug 26 '13 at 12:50
  • Indeed, I'd use the end variable in my own method, I didn't see that I remove any use of it... For the None returns, it can be discussed, since you expect a string if it works, not the `NoneType`. But well, if you insists... I return a nice ValueError message instead :-) – Maxime Lorant Aug 26 '13 at 12:59
  • In Python exceptions are *raised*, not returned. – user4815162342 Aug 26 '13 at 13:01
  • A typo again, I'm not quite awake yet at 3p.m. :-) – Maxime Lorant Aug 26 '13 at 13:05