Efficient way to read data in python

Question

Possible Duplicate:
Lazy Method for Reading Big File in Python?

I need to read 100 GB (400 million lines) of data from a file line by line. This is my current code, but is there any efficient method to do this. I mean in terms of execution speed.

f = open(path, 'r')

for line in f: 
    ...

f.close()

Unbelievable. Obviously something is wrong in your application if it generates 100GB File :-) — Siva Arunachalam, Apr 04 '11 at 14:19
@Rest: 100GB isn't necessarily a -1. Maybe the OP really does have that much data! (CERN is estimated to generate 40,000GB per day.) — Katriel, Apr 04 '11 at 14:40
@katrielalex: I don't think the CERN experiments create a single (or even a few) large files. It think they create a large number of relatively small files. Similarly, I think that creating a 100GB file is the root cause problem that should be fixed. Preventing a 100GB file (through many smaller files that can be handled in parallel or through effective summaries) is a much better solution than trying to read one immense file quickly. I think a -1 is justified since the this question doesn't address the real problem. — S.Lott, Apr 04 '11 at 15:42

score 2 · Answer 1 · answered Apr 04 '11 at 14:25

If the lines are of fixed byte length, and the lines do not have to be read in any particular order (you can still know the line number though), than you can easily split it into parallel subtasks, executing in multiple threads/processes. Each subtusk would only need to know to where to seek() and how many bytes to read().

Also in such a case, it's not optimal to read line by line, as it needs to scan for \n, but rather just use read() with fixed length.

score 2 · Answer 2 · edited Apr 04 '11 at 18:09

If you have a multicore machine, and can use Python 3.2 (instead of Python 2), this would be a good use case for the concurrent.futures new feature in Python 3.2 - depending on the processing you need to do with each line. If you require that the processing is done in the file order, you'd probably have to worry about reassembling the output later on.

Otherwise, using concurrent.futures can schedule each client to be processed in a different task with little effort. What is the output you have to generate on that?

If you think you would not profit from parallelizing each line's contents, the most obvious way is the best way to do: that is, what you have just done.

This example divides the processing in up to 12 sub-process, each one executing Python 's built-in len function. Replace len for a function that receives the line as a parameter and performs whatever you need to process on that line:

from concurrent.futures import ProcessPoolExecutor as Executor

with Executor(max_workers=5) as ex:
    with open("poeem_5.txt") as fl:
       results = list(ex.map(len, fl))

The "list" call is needed to force the mapping to be done within the "with" statement. If you don't need a scalar value for each line, but rather to record a result in a file you can do it in a for loop instead:

for line in fl:
   ex.submit(my_function, line)

Efficient way to read data in python

2 Answers2