1

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?

I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair

def kernel:
    a_file=open('a.txt','r')
    b_file=open('b.txt', 'r')
    a_line = a_file.readline()
    b_line = b_file.readline()
    while a_line:
        process(a_spl,b_spl) #process requiring both corresponding file lines

I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.

thank you.

Duke
  • 348
  • 1
  • 6
  • 16
  • Python isn't great for speed. C or C++ is recommended. Try: http://stackoverflow.com/questions/5164538/how-can-i-speed-up-line-by-line-reading-of-an-ascii-file-c – Alvin K. Nov 08 '11 at 04:07
  • @Alvin K.: Profile first: Python is still fast enough to be able to outstrip *most* forms of I/O, especially if said I/O is hitting disk or network. – Thanatos Nov 09 '11 at 04:13
  • @Thanatos: profiling is also mentioned in the link above, which claims that I/O isn't the main bottleneck. Thanks for highlighting it. – Alvin K. Nov 12 '11 at 02:57

4 Answers4

2

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.

from itertools import izip

def process(line1, line2):
  # process a line from each input

with open(file1, 'r') as f1:
  with open(file2, 'r') as f2:
    for a, b in izip(f1, f2):
      process(a, b)

If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.

wberry
  • 18,519
  • 8
  • 53
  • 85
1

You can use with statement to make sure your files are closed after the execution. From this blog entry:

to open a file, process its contents, and make sure to close it, you can simply do:

with open("x.txt") as f:
    data = f.read()
    do something with data
Tadeck
  • 132,510
  • 28
  • 152
  • 198
1

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:

queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
     queue.put(pair) # blocks here on full queue

You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

Lars Yencken
  • 2,976
  • 2
  • 21
  • 12
  • 8
    In Python 2.x, be sure to use *itertools.izip()* so that the zip step doesn't happen all at once and pull both files completely into memory. – Raymond Hettinger Nov 08 '11 at 04:35
  • maybe so, i'm reading in a list of tokens at each step and storing it in a list. maybe that's the bottleneck. program is hogging memory. – Duke Nov 08 '11 at 13:44
  • Thanks Raymond for picking up that glaring mistake. Absolutely use izip in this scenario, not zip. I've fixed the example. – Lars Yencken Nov 09 '11 at 04:10
  • thanks for the advice. got it to work using izip. also resolved my memory issue. did some optimizations on the algorithm and reduced the space consumption. – Duke Nov 11 '11 at 20:59
0

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.

while a_line and b_line

Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

ObscureRobot
  • 7,306
  • 2
  • 27
  • 36