1

I need to mergeSort text files which are about 150 MB each, and together will amount to about 5GB

The problem is that i can't use mergesort using readlines(), since the last step would need to load 5GB into the memory, and with only the

for line1 in file1, line2 in file2:
    while( line1 & line2 )...

command, i can't tell python to only get the next line of file 1, and keep the line of file 2, and thus are unable to make a merge sort

i read something about setting the readbuffer really low on readlines(), only loading a single line into the memory, but then i can't delete the first line from the file

is there any other memory efficient way to get only the first line of a file and deleting it, or is there an available function to mergesort two text files somewhere allready?

  • take a look at http://docs.python.org/2/library/fileinput.html – corn3lius Feb 20 '14 at 14:37
  • Would this help at all: http://stackoverflow.com/questions/2064184/remove-lines-from-textfile-with-python Seeing how you don't need to load the contents of the 5GB file, you only need to read in the minor ones, loop through each one of them and appending them (yes, `with open(file, 'a') as fh:`) to the large file – Torxed Feb 20 '14 at 14:38

1 Answers1

1

command, i can't tell python to only get the next line of file 1, and keep the line of file 2, and thus are unable to make a merge sort

No you can.

line1 = file1.readline()
line2 = file2.readline()
while file1_not_at_end and file2_not_at_end:
    if line1 < line2:
        file3.write(line1)
        line1 = file1.readline()
    else:
        file3.write(line2)
        line2 = file2.readline()

 # merge file 1 into file 3
 # merge file 2 into file 3
User
  • 14,131
  • 2
  • 40
  • 59