2

I have two .txt files. File1.txt and File2.txt. How can i merge these two files in unix (may be with awk) based on one common column.

File1.txt looks like

Sub_ID  Sam_ID   v1              
1878372 2253734 SAMN06396112           
1883177 2264293 SAMN06414028          
1884646 2275341 SAMN06432785         
1860945 2277481 SAMN06407597  

File2.txt looks like

Sam_ID  code    V3      V4 
2253734 20481   NA      DNA   
2275341 20483   NA      DNA    
2277481 20488   NA      DNA   

Final output file after merging should look like this

Finalfile.txt

Sub_ID  Sam_ID   v1               code   V3      V4
1878372 2253734 SAMN06396112     20481   NA      DNA                
1884646 2275341 SAMN06432785     20483   NA      DNA     
1860945 2277481 SAMN06407597     20488   NA      DNA  

I have tried Join so far but may be i am not fully understanding the command (being new to unix).

sort -k2b File1.txt >sorted_file1.txt 
sort File2.txt >sorted_file2.txt 
join -1 2 sorted_file1.txt sorted_file2.txt > Finalfile.txt

I understand that by k2b i am nominating the second column of File_1 to be common among two and then merging.

anubhava
  • 761,203
  • 64
  • 569
  • 643
Aryh
  • 479
  • 1
  • 4
  • 16

3 Answers3

3

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID  v1  code    V3  V4
#2253734    1878372 SAMN06396112    20481   NA  DNA
#2275341    1884646 SAMN06432785    20483   NA  DNA
#2277481    1860945 SAMN06407597    20488   NA  DNA

Explanation:

  • join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
  • the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
  • in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID  v1  code    V3  V4
#1878372    2253734 SAMN06396112    20481   NA  DNA
#1884646    2275341 SAMN06432785    20483   NA  DNA
#1860945    2277481 SAMN06407597    20488   NA  DNA
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Hi Jared: Thanks but the only problem i am facing with my attempt is the headers are removed. Is there anyway to avoid that? – Aryh Jun 18 '21 at 05:12
  • In my example the headers from each file are retained. Are you sure they are separated by a single tab? – jared_mamrot Jun 18 '21 at 05:15
2

This single awk should work file for merge task:

awk 'FNR==NR {map[$2] = $0; next} {print map[$1], $0}' File1.txt File2.txt | column -t

Sub_ID   Sam_ID   v1            Sam_ID   code   V3  V4
1878372  2253734  SAMN06396112  2253734  20481  NA  DNA
1884646  2275341  SAMN06432785  2275341  20483  NA  DNA
1860945  2277481  SAMN06407597  2277481  20488  NA  DNA

column -t has been used for tabular output only.

PS: If first column from File2 is not found in File1 then you will get a leading space in output.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

With cut grep and pr and ProcSub from bash.

pr -mt <(grep -Fv 1883177 file1.txt) <(cut -d' ' -f2- file2.txt)

Output

Sub_ID  Sam_ID   v1                  code    V3      V4
1878372 2253734 SAMN06396112        20481   NA      DNA
1884646 2275341 SAMN06432785        20483   NA      DNA
1860945 2277481 SAMN06407597        20488   NA      DNA

Or paste

paste -d' ' <(grep -Fv 1883177 file1.txt) <(cut -d' ' -f2- file2.txt) | column -t > Finalfile.txt
Jetchisel
  • 7,493
  • 2
  • 19
  • 18