17

I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100's of thousands to millions of lines), and I'd like to do this as quickly as possible.

a) Bad Solution: Read in the entire file, and go to those lines,

f = open('filename')
lines = f.readlines()[4003:4005]

b) Better Solution: enumerate over each line so that it's not all in memory (a la https://stackoverflow.com/a/2081880/230468)

f = open('filename')
lines = []
for i, line in enumerate(f):
    if i >= 4003 and i <= 4005: lines.append(line)
    if i > 4005: break                                    # @Wooble

c) Best Solution?

But b) still requires going through each line.

Is there a better (in terms of speed/efficiency) method of accessing a particular line from a huge file?

  • Should I use a linecache even though I will only access the file once (typically)?
  • Using a binary file instead, in which case it might be easier to skip-ahead, is an option --- but I'd much rather avoid it.
smci
  • 32,567
  • 20
  • 113
  • 146
DilithiumMatrix
  • 17,795
  • 22
  • 77
  • 119

3 Answers3

20

I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with block, closing the filehandle.

from itertools import islice
with open('afile') as f:
    lines = list(islice(f, 4003, 4005))
do_something_with(lines)

Update

But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.

>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004

0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003

2.193450927734375e-05

>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195

Constantly re-importing and re-reading the file:

This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.

>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075

Conclusion

Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.

Community
  • 1
  • 1
kojiro
  • 74,557
  • 19
  • 143
  • 201
  • Thanks @kojiro, Why is this better/worse/different than `linecache` or `enumerate`? – DilithiumMatrix Oct 04 '13 at 20:35
  • @FredrikPihl linecache is optimized for reading many lines from a file. OP is reading 2. – kojiro Oct 04 '13 at 20:38
  • 1
    …but I can't argue with the results. – kojiro Oct 04 '13 at 20:49
  • 1
    linecache reads in the whole file and caches it internally. This can be pretty fast particularly if the file is stored as one big block on the hdd. Even a 10mb file can be read in less than 100ms. This get even faster if many random accesses are needed. – Scheintod Oct 04 '13 at 21:02
  • I believe the timings are misleading because `getline` caches the whole file, so all but the first run will be very fast. Just anecdotal evidence (because I didn't profile, etc), but I used your `islice` solution a couple of days ago. – jorgeca Oct 04 '13 at 21:07
  • @jorgeca Yes, but on the other hand, I ran it only once the first time and it's still quite a bit faster than `islice`. – kojiro Oct 04 '13 at 21:10
  • @kojiro On a fully clean session? Just have a look at all the things `getline` does [before returning the line](http://hg.python.org/cpython/file/2.7/Lib/linecache.py) (which includes reading the whole file). – jorgeca Oct 04 '13 at 21:13
  • @jorgeca edited to include constantly re-creating the linecache. – kojiro Oct 04 '13 at 21:20
  • @kojiro if your million-line file only contains the numbers from 1 to a million, it's not exactly big. For a complete benchmark, you'd probably have to try the same thing with a million lines of varying sizes from 10 to 1k bytes each, and then compare access times for lines at the beginning, middle and end of the file. I'd guess that in the case of really big files, `islice` etc. would be faster for the beginning portion of the file, but would be outperformed by `linecache` after a certain line (assuming `linecache` doesn't just crash because it runs out of memory...) – l4mpi Oct 04 '13 at 21:32
  • linecache can be used in combination with splitting the files in smaller chunks of 1-10k lines each which can be accessed directly. (see my edit) But care must be taken to clear the cache manually or it will contain all the files eventually. – Scheintod Oct 04 '13 at 21:37
  • @l4mpi I don't buy the argument that a larger file implies a more complete benchmark. It just makes for a different benchmark. *Complete* would involve a graph of a lot of benchmarks of different-sized files. OP didn't actually say his file was *large*, he just said it was *long*. So I created a million-line long file. – kojiro Oct 05 '13 at 03:19
  • @kojiro that's what I said; testing with varying line lengths and thus file sizes, as your benchmark doesn't test cases that are more limited by I/O... – l4mpi Oct 05 '13 at 10:03
8

The main problem here is, that linebreaks are in no way different than any other character. So the OS has no way of skipping to that line.

That said there are a few options but for every one you have to make sacrifices in one way or another.

You did already state the first one: Use a binary file. If you have fixed line-length, then you can seek ahead line * bytes_per_line bytes and jump directly to that line.

The next option would be using an index: create a second file and in every line of this index file write the byte-index of the line in your datafile. Accessing the datafile now involves two seek operation (skip to line of index, then skip to index_value in datafile) but it will still be pretty fast. Plus: Will save diskspace because the lines can have different length. Minus: You can't touch the datafile with an editor.

One more option: (I think I would go with this) is to use only one file but begin every line with the line-number and some kind of seperator. (e.g. 4005: My data line). Now you can use a modified version of binary search https://en.wikipedia.org/wiki/Binary_search_algorithm to seek for your line. This will take around log(n) seek operations with n being the total number of lines. Plus: You can edit the file and it saves space compared to fixed length lines. And it's still very fast. Even for one million lines this are only about 20 seek operations which happen in no time. Minus: The most complex of these posibilities. (But fun to do ;)

EDIT: One more solution: Split your file in many smaler ones. If you have very long 'lines' this could be as small as one line per file. But then I would put them in groups in folders like e.g. 4/0/05. But even with shorter lines divide your file in - let's say roughly - 1mb chunks, name them 1000.txt, 2000.txt and read the one (or two) matching your line completely should be pretty fast end very easy to implement.

Scheintod
  • 7,953
  • 9
  • 42
  • 61
  • I think the index file solution is the best idea here :) – Maxime Lorant Oct 04 '13 at 20:35
  • I also like the index solution, but obviously it would only help for lines beyond some critical line-number (i.e. its slower to use the index if the target line-number is the first line). It would be very interesting to compare between these methods. – DilithiumMatrix Oct 04 '13 at 20:38
  • Thanks @Scheintod, the last option does sound pretty awesome! – DilithiumMatrix Oct 04 '13 at 20:41
  • One problem however: the contents of each line will be lots of numbers, in particular long series of decimals... Would searching for `4005:` be enough to prevent accidentally going to 4.9340052? – DilithiumMatrix Oct 04 '13 at 20:43
  • @zhermes: yes. for small files the quickest solution is to read in the whole file. I would guess the other methods could pay of somewhere between 10k and 1M lines. But this still depends on line lenght of cause. And of cause heavily on the type of harddisk used (hdd/ssd). – Scheintod Oct 04 '13 at 20:46
  • @zhermes: if you don't have `:` character in your file there should be no problem. You can use other delimiters or add in | character before 4005, too. – Scheintod Oct 04 '13 at 20:49
  • 1
    The good part of the third method is, that it scales to files the size of the complete harddisk without getting noticeable slower. So there is room :) – Scheintod Oct 04 '13 at 20:52
1

I ran into a similar problem as the post above, however, the solutions posted above have problems in my particular scenario; the file was too big for linecache and islice was nowhere near fast enough. I would like to offer a third (or fourth) alternative solution.

My solution is based upon the fact that we can use mmap to access a particular point in the file. We need only know where in a file that lines begin and end, then the mmap can give those to us comparably as fast as linecache. To optimize this code (see the updates):

  • We use the deque class from collections to create a dynamically lengthed collection of endpoints.
  • We then convert that to a list which optimizes random access to that collection.

The following is a simple wrapper for the process:

from collections import deque
import mmap

class fast_file():
    
    def __init__(self, file):
        self.file = file
        self.linepoints = deque()
        self.linepoints.append(0)
        pos = 0
        with open(file,'r') as fp:
            while True:
                c = fp.read(1)
                if not c:
                    break
                if c == '\n':
                    self.linepoints.append(pos)
                    pos += 1
                pos += 1
        self.fp = open(self.file,'r+b')
        self.mm = mmap.mmap(self.fp.fileno(),0 )
        self.linepoints.append(pos)
        self.linepoints = list(self.linepoints)
                
    def getline(self, i):
        return self.mm[self.linepoints[i]:self.linepoints[i+1]]          
    
    def close(self):
        self.fp.close()
        self.mm.close()

The caveat is that the file, mmap needs closing and the enumerating of endpoints can take some time. But it is a one-off cost. The result is something that is both fast in instantiation and in random file access, however, the output is an element of type bytes.

I tested speed by looking at accessing a sample of my large file for the first 1 million lines (out of 48mil). I ran the following to get an idea of the time took to do 10 million accesses:

linecache.getline("sample.txt",0)
F = fast_file("sample.txt")


sleep(1)
start = time()
for i in range(10000000):
    linecache.getline("sample.txt",1000)
print(time()-start)

>>> 6.914520740509033

sleep(1)
start = time()
for i in range(10000000):
    F.getline(1000)
print(time()-start) 

>>> 4.488042593002319

sleep(1)
start = time()
for i in range(10000000):
    F.getline(1000).decode()
print(time()-start) 

>>> 6.825756549835205

It's not that much faster and it takes some time to initiate (longer in fact), however, consider the fact that my original file was too large for linecache. This simple wrapper allowed me to do random accesses for lines that linecache was unable to perform on my computer (32Gb of RAM).

I think this now might be an optimal faster alternative to linecache (speeds may depend on i/o and RAM speeds), but if you have a way to improve this, please add a comment and I will update the solution accordingly.

Update

I recently replaced a list with a collections.deque which is faster.

Second Update The collections.deque is faster in the append operation, however, a list is faster for random access, hence, the conversion here from a deque to a list optimizes both random access times and instantiation. I've added sleeps in this test and the decode function in the comparison because the mmap will return bytes to make the comparison fair.