I'm working with files that have ~400 million rows and 3 columns. The first two columns are alpha-numeric strings while the last is numeric. Something like this:
NM_001011874.1,NM_011441.4,-0.131672299779153
I've multiple such files with approximately the same number of rows and 3 columns. These aren't necessarily sorted based on any of the 3 columns. I'm trying to combine these files based on the combination of the first two columns. For instance:
File 1
NM_001011874.1,XR_104537.1,-0.929524370261122
NM_001011874.1,XM_003084433.1,-0.603098764428879
File 2
NM_001011874.1,XR_104537.1,-0.11254525414
NM_001011874.1,NM_005040.1,-0.20509876488
File 3
NM_001011874.1,XR_104537.1,-0.41254525414
NM_001011874.1,NM_005040.1,-0.60509876488
What I'm trying to do is create a key by using the combination of values in the first two columns and then retrieving the corresponding value from the third column for that pair. The final output I get is:
Output2
NM_001011874.1,XR_104537.1,-0.11254525414,-0.929524370261122,-0.41254525414
NM_001011874.1,NM_005040.1,-0.20509876488,,-0.60509876488
I'm using awk to do the above:
awk -F',' 'NR==FNR{a[$1,$2]=$3;next}{$4=a[$1,$2];print}' OFS=',' file1.txt file2.txt
I'm allocating 256GB for the task. It takes about 90mins using the above command for producing the output by combining through two files where each file has ~400 million rows and 3 columns. The output file again has ~400 millions rows but 4 columns. Time taken to produce the output file increases with every added column.
I'm doing it sequentially, i.e. merge file1 and file2 to produce output1 with 4 columns. Then merge file3 and output1 to produce output2 with 5 columns, then file4 and output2 to produce output3 with 6 columns and so on until I've the final output with 22 columns.
I wonder if it would be more efficient in terms of speed and automation to do this in Python or Perl? I've about 20 such files with 3 columns each, although the rows vary from ~100 million to ~400 million. If you think I would be better off doing this in Python or Perl, can you please share an example to illustrate how the awk script translates to Python or Perl.
Edit: Added File 3 and the final output after that based on the comments.