I have many huge (>40Gb) text files, which contains same lines into one file and in other files. I need to merge this files under Windows into one big file without line dublicates.
I want to do this by Python, due to fast hashtable.
As I understand I need:
- Sort each file
- Open each file and read by line until end
smth like:
sort(file1)
sort(file2)
openRead(file1)
openRead(file2)
openWrite(file3)
string previousLine = ""
string line1 = read(file1)
string line2 = read(file2)
do {
if (line1 > line2) {
if (previousLine != line2) {
write(line2, file3)
previousLine = line2
}
line2 = read(file2)
} else {
if (line2 > line1) {
if (previousLine != line1) {
write(line1, file3)
previousLine = line1
}
line1 = read(file1)
} else {
if (previousLine != line1) {
write(line1, file3)
previousLine = line1
}
line1 = read(file1)
line2 = read(file2)
}
}
} while (!Eof(file1) && !Eof(file2))
readBiggerFileToEndAndWriteLinesTo(file3)
Is this idea correct? Or Python propose more faster solution? (I have only 32Gb memory). How can I write this solution in Python?