Python: efficient file io

Question

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?

I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair

def kernel:
    a_file=open('a.txt','r')
    b_file=open('b.txt', 'r')
    a_line = a_file.readline()
    b_line = b_file.readline()
    while a_line:
        process(a_spl,b_spl) #process requiring both corresponding file lines

I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.

thank you.

Python isn't great for speed. C or C++ is recommended. Try: http://stackoverflow.com/questions/5164538/how-can-i-speed-up-line-by-line-reading-of-an-ascii-file-c — Alvin K., Nov 08 '11 at 04:07
@Alvin K.: Profile first: Python is still fast enough to be able to outstrip *most* forms of I/O, especially if said I/O is hitting disk or network. — Thanatos, Nov 09 '11 at 04:13
@Thanatos: profiling is also mentioned in the link above, which claims that I/O isn't the main bottleneck. Thanks for highlighting it. — Alvin K., Nov 12 '11 at 02:57

wberry · Accepted Answer · 2014-09-10T19:04:36.193

2

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.

from itertools import izip

def process(line1, line2):
  # process a line from each input

with open(file1, 'r') as f1:
  with open(file2, 'r') as f2:
    for a, b in izip(f1, f2):
      process(a, b)

If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.

edited Sep 10 '14 at 19:04

answered Nov 08 '11 at 20:39

wberry

18,519
8
53
85

1

You forgot to mention that your file **requires `itertools` module**. Without the following it won't work: `from itertools import izip`. – Tadeck Nov 09 '11 at 02:37
so how can you explicitly close it if you don't give it a handle? – Duke Nov 10 '11 at 02:51
Edited to address both comments above. – wberry Sep 10 '14 at 19:05

score 1 · Answer 2 · answered Nov 08 '11 at 04:06

You can use with statement to make sure your files are closed after the execution. From this blog entry:

to open a file, process its contents, and make sure to close it, you can simply do:

with open("x.txt") as f:
    data = f.read()
    do something with data

Lars Yencken · Answer 3 · 2011-11-09T04:09:25.253

1

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:

queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
     queue.put(pair) # blocks here on full queue

You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

edited Nov 09 '11 at 04:09

answered Nov 08 '11 at 04:12

Lars Yencken

2,976
2
21
12

8

In Python 2.x, be sure to use *itertools.izip()* so that the zip step doesn't happen all at once and pull both files completely into memory. – Raymond Hettinger Nov 08 '11 at 04:35
maybe so, i'm reading in a list of tokens at each step and storing it in a list. maybe that's the bottleneck. program is hogging memory. – Duke Nov 08 '11 at 13:44
Thanks Raymond for picking up that glaring mistake. Absolutely use izip in this scenario, not zip. I've fixed the example. – Lars Yencken Nov 09 '11 at 04:10
thanks for the advice. got it to work using izip. also resolved my memory issue. did some optimizations on the algorithm and reduced the space consumption. – Duke Nov 11 '11 at 20:59

score 0 · Answer 4 · answered Nov 08 '11 at 04:01

0

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.

while a_line and b_line

Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

answered Nov 08 '11 at 04:01

ObscureRobot

7,306
2
27
36

we are sure it will be the same line. goal is to reduce r.t. so avoided extra check – Duke Nov 08 '11 at 04:04
Your program will crash horribly when you least expect it. If you have a GOOD REASON to be that worried about cycles, switch to C. – ObscureRobot Nov 08 '11 at 04:09
zip will stop at the shorter of the two files – PaulMcG Nov 08 '11 at 04:45

Python: efficient file io

4 Answers4