I have hundreds of text files with data in the form of (id1, id2, value):
1050 20482 25
9582 92883 48
2750275 28032 3
The data within each text file is sorted by id1, then id2. The fields are tab delimited with '\n' at the end of each line. It's large: each file has 500,000 lines, and the hundreds of files can't be merged in memory. [I generate the text data files myself, so I can change the format or the number of lines per file if it would make things easier.]
Heapq.merge() has a key function in Python 3.5, but afaik, there's nothing in Python 2.7.
In Python 2.7, what is an efficient way to merge these hundreds of sorted files into one big text file in the same format of (id1 id2 value), with the data sorted by id1, then by id2?
Note: I don't know why, but Python sorts the numbers as strings, so I had to force it to interpret as numbers. To get around that, I use this itertools.imap
per this answer:
infiles = ['file1.txt','file2.txt',...,'file592.txt']
files = [open(fn) for fn in infiles]
with contextlib.nested(*files):
with open('Results.txt', 'w') as f:
#f.writelines(heapq.merge(*files)) <<<--- the standard way
for line in map(str, heapq.merge(*(itertools.imap(int, file) for file in files))): #<<<--- forces ids to be sorted as integers
f.write(line+'\n')