I have a file with 100s of thousands of records, one per line. I need to read 100, process them, read another 100, process them and so forth. I don't want to load those many records and keep them in memory. How do I read (until EOF) either 100 or less (when EOF is encountered) lines from an open file using Python?
-
please define "record" – notorious.no Apr 08 '15 at 19:45
-
1Call `readline()` 100 times... stop calling it if you hit EOF...? – John Kugelman Apr 08 '15 at 19:46
-
Is there a specific reason you need to process them 100 at a time, rather than one at a time, or 64 at a time, or whatever? Is this to do with buffering, or is there something in particular about 100? – DNA Apr 08 '15 at 19:49
-
Related to : http://stackoverflow.com/questions/24716001/python-reading-in-a-text-file-in-a-set-line-range . The accepted answer seems to fit your needs, with a little tweaking so you can do the 100 first, then 100 others, etc. I don't know the impact on the memory though. – Apr 08 '15 at 19:51
-
@DNA: I need to process 100 at a time because I'm using an api with a cap on the number of calls I can make. I can obviously parameterize the value for other APIs. – Dervin Thunk Apr 08 '15 at 19:52
-
Take this! http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python – Paulo Abreu Apr 08 '15 at 19:56
5 Answers
islice()
can be used to retrieve the next n
items of an iterator.
from itertools import islice
with open(...) as file:
while True:
lines = list(islice(file, 100))
for line in lines:
# do stuff
if not lines:
break

- 2,399
- 2
- 19
- 25
-
It works well. I would suggest putting the ``if not lines`` before ``do stuff`` to avoid a blank list at the end of the while loop. – Libin Wen Jun 11 '17 at 05:24
with open('file.txt', 'r') as f:
workset = [] # start a work set
for line in f: # iterate over file
workset.append(line) # add current line to work set
if len(workset) == 100: # if 100 items in work set,
dostuff(workset) # send work set to processing
workset = [] # make a new work set
if workset: # if there's an unprocessed work set at the end (<100 items),
dostuff(workset) # process it

- 48,464
- 6
- 60
- 97
A runnable example using the take
recipe from the itertools page:
from itertools import islice
# Recipe from https://docs.python.org/2/library/itertools.html
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
if __name__ == "__main__":
with open('data.txt', 'r') as f:
while True:
lines = take(100, f)
if lines:
print(lines)
else:
break

- 21,988
- 13
- 81
- 109

- 42,007
- 12
- 107
- 146
You could utilize i_zip_longest
in the grouper
recipe, which would also address your EOF issue:
with open("my_big_file") as f:
for chunk_100 in izip_longest(*[f] * 100)
#record my lines
Here we are simply iterating over our file lines, and specifying our fixed-length chunk to be 100 lines.
A simple example of the grouper
recipe (from the docs):
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

- 28,857
- 6
- 80
- 93
file.readlines(sizehint= <line size in Bytes> )
instead of creating your own iterator, you can use the built-in one.
python's method file.readlines()
returns a list of all the lines in the file.
if the file is too big it wont fit in memory.
so, you can use the parameter sizehint
.
it will read the sizehint
Bytes(and not lines) from the file, and enough more to complete a line, and returns the lines from that.
Only complete lines will be returned.
for example:
file.readlines(sizehint=1000)
it will read the 1000 Bytes from the file.

- 458
- 4
- 17