-1

I have many huge (>40Gb) text files, which contains same lines into one file and in other files. I need to merge this files under Windows into one big file without line dublicates.

I want to do this by Python, due to fast hashtable.

As I understand I need:

  1. Sort each file
  2. Open each file and read by line until end

smth like:

sort(file1)
sort(file2)

openRead(file1)
openRead(file2)
openWrite(file3)
string previousLine = ""
string line1 = read(file1)
string line2 = read(file2)

do {
    if (line1 > line2) {
       if (previousLine != line2) {
           write(line2, file3)
           previousLine = line2
       }
       line2 = read(file2)
    } else {
        if (line2 > line1) {
            if (previousLine != line1) {
                write(line1, file3)
                previousLine = line1
            }
            line1 = read(file1)
        } else {
            if (previousLine != line1) {
                write(line1, file3)
                previousLine = line1
            }
            line1 = read(file1)
            line2 = read(file2)
        }
    }
} while (!Eof(file1) && !Eof(file2))

readBiggerFileToEndAndWriteLinesTo(file3)

Is this idea correct? Or Python propose more faster solution? (I have only 32Gb memory). How can I write this solution in Python?

Anthon
  • 69,918
  • 32
  • 186
  • 246
user809808
  • 779
  • 1
  • 10
  • 23

1 Answers1

0

Assuming that the merge result fits into memory (since there are duplicated lines it may be the case), you can create a set into which add all the lines you are reading from the file: the set will ensure not to have duplicated strings. To read the file you can do:

with open(...) as f:
    for line in f:
        # add the line in the set

No need to worry about reading large files here (more reading: How to read large file, line by line in python)

If the data doesn't fit into memory (so it's more than 32GB, well actually more than something less than 32GB), you would need to split the whole process into chunks.

Community
  • 1
  • 1
Daniele Pantaleone
  • 2,657
  • 22
  • 33