0

I was trying to compare two large text files line by line (10GB each) without loading the entire files into memory. I used the following code as indicated in other threads:

with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
    for (line1, line2) in zip(f1, f2):
        compare(line1, line2)

But it seems that python fails to read the file line by line. I observed the memory usage while running the code is > 20G. I also tried using:

import fileinput
for (line1, line2) in zip(fileinput.input([in_file1]),fileinput.input([in_file2])):
    compare(line1, line2)

This one also tries to load everything into memory. I'm using Python 2.7.4 on Centos 5.9, and I didn't store any of the lines in my code.

What was going wrong in my code? How should I change it to avoid loading everything into RAM?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Ken Ma
  • 301
  • 1
  • 5
  • 13

1 Answers1

6

Python's zip function returns a list of tuples. So if fetches the complete files to build this list. Use itertools.izip instead. It will return an iterator of tuples.

with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
    for (line1, line2) in izip(f1, f2):
        compare(line1, line2)
Thomas B.
  • 2,276
  • 15
  • 24