I'm trying to search some keywords through a large text file (~232GB). I want to take advantage of buffering for speed concerns and also want to record beginning positions of lines containing those keywords.
I've seen many posts here discussing similar questions. However, those solutions with buffering (use file as iterator) cannot give correct file position, and those solutions give correct file positions usually simply use f.readline()
, which does not use buffering.
The only answer I saw that can do both is here:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
However, I'm not sure whether the offset += len(line)
operation will cost unnecessary time. Is there a more direct way to do this?
UPDATE:
I've done some timing but it seems that .readline()
is much slower than using file object as an iterator, on python 2.7.3
. I used the following code
#!/usr/bin/python
from timeit import timeit
MAX_LINES = 10000000
# use file object as iterator
def read_iter():
with open('tweets.txt','r') as f:
lino = 0
for line in f:
lino+=1
if lino == MAX_LINES:
break
# use .readline()
def read_readline():
with open('tweets.txt', 'r') as f:
lino = 0
for line in iter(f.readline,''):
lino+=1
if lino == MAX_LINES:
break
# use offset+=len(line) to simulate f.tell() under binary mode
def read_iter_tell():
offset = 0
with open('tweets.txt','rb') as f:
lino = 0
for line in f:
lino+=1
offset+=len(line)
if lino == MAX_LINES:
break
# use f.tell() with .readline()
def read_readline_tell():
with open('tweets.txt', 'rb') as f:
lino = 0
for line in iter(f.readline,''):
lino+=1
offset = f.tell()
if lino == MAX_LINES:
break
print ("iter: %f"%timeit("read_iter()",number=1,setup="from __main__ import read_iter"))
print ("readline: %f"%timeit("read_readline()",number=1,setup="from __main__ import read_readline"))
print ("iter_tell: %f"%timeit("read_iter_tell()",number=1,setup="from __main__ import read_iter_tell"))
print ("readline_tell: %f"%timeit("read_readline_tell()",number=1,setup="from __main__ import read_readline_tell"))
And the result is like:
iter: 5.079951
readline: 37.333189
iter_tell: 5.775822
readline_tell: 38.629598