I ran into a similar problem as the post above, however, the solutions posted above have problems in my particular scenario; the file was too big for linecache and islice was nowhere near fast enough. I would like to offer a third (or fourth) alternative solution.
My solution is based upon the fact that we can use mmap to access a particular point in the file. We need only know where in a file that lines begin and end, then the mmap can give those to us comparably as fast as linecache. To optimize this code (see the updates):
- We use the deque class from collections to create a dynamically lengthed collection of endpoints.
- We then convert that to a list which optimizes random access to that collection.
The following is a simple wrapper for the process:
from collections import deque
import mmap
class fast_file():
def __init__(self, file):
self.file = file
self.linepoints = deque()
self.linepoints.append(0)
pos = 0
with open(file,'r') as fp:
while True:
c = fp.read(1)
if not c:
break
if c == '\n':
self.linepoints.append(pos)
pos += 1
pos += 1
self.fp = open(self.file,'r+b')
self.mm = mmap.mmap(self.fp.fileno(),0 )
self.linepoints.append(pos)
self.linepoints = list(self.linepoints)
def getline(self, i):
return self.mm[self.linepoints[i]:self.linepoints[i+1]]
def close(self):
self.fp.close()
self.mm.close()
The caveat is that the file, mmap needs closing and the enumerating of endpoints can take some time. But it is a one-off cost. The result is something that is both fast in instantiation and in random file access, however, the output is an element of type bytes.
I tested speed by looking at accessing a sample of my large file for the first 1 million lines (out of 48mil). I ran the following to get an idea of the time took to do 10 million accesses:
linecache.getline("sample.txt",0)
F = fast_file("sample.txt")
sleep(1)
start = time()
for i in range(10000000):
linecache.getline("sample.txt",1000)
print(time()-start)
>>> 6.914520740509033
sleep(1)
start = time()
for i in range(10000000):
F.getline(1000)
print(time()-start)
>>> 4.488042593002319
sleep(1)
start = time()
for i in range(10000000):
F.getline(1000).decode()
print(time()-start)
>>> 6.825756549835205
It's not that much faster and it takes some time to initiate (longer in fact), however, consider the fact that my original file was too large for linecache. This simple wrapper allowed me to do random accesses for lines that linecache was unable to perform on my computer (32Gb of RAM).
I think this now might be an optimal faster alternative to linecache (speeds may depend on i/o and RAM speeds), but if you have a way to improve this, please add a comment and I will update the solution accordingly.
Update
I recently replaced a list with a collections.deque which is faster.
Second Update
The collections.deque is faster in the append operation, however, a list is faster for random access, hence, the conversion here from a deque to a list optimizes both random access times and instantiation. I've added sleeps in this test and the decode function in the comparison because the mmap will return bytes to make the comparison fair.