2

I try to merge two tables. Column7 in TableA has corresponding entries in Column 1 in TableB. I want to extract the values from Column 3 in TableB and add them to the corresponding row in TableA (based on its value in column 7).

Importantly, some values appear multiple times in Column 7 of TableA and these should all receive the same value that was extracted from TableB.

TableA has 10 columns of the format:

OTU_8   dbj|AB021887.1| 3.04e-84    100.000 315 0   AB021887
OTU_142 dbj|AB021887.1| 5.05e-82    99.412  307 0   AB021887
OTU_124 gb|AF156149.1|  4.97e-25    76.106  119 0   AF156149
OTU_145 gb|AF156149.1|  2.28e-33    78.319  147 0   AF156149
OTU_27  gb|AF156151.1|  2.36e-18    84.000  97.1    0   AF156151

TableB has four columns:

AB021887        AB021887.1      7936    12248848
AF156149        AF156149.1      114741  7682414
AF156151        AF156151.1      114754  7682418
AP014556        AP014556.1      62819   1237088233
AP017673        AP017673.1      29170   1089667374
AP017981        AP017981.1      1450757 1148885259
AW360743        AW360743.1      10090   6865393

I have tried the following command:

awk 'NR==FNR{a[$7]=$0; next} ($1 in a) {print a[$1],$3}' TableA TableB > TableC

However, it ignores duplicates in TableA and instead of

OTU_8   dbj|AB021887.1| 3.04e-84    100.000 315 0   AB021887      7936
OTU_142 dbj|AB021887.1| 5.05e-82    99.412  307 0   AB021887      7936
OTU_124 gb|AF156149.1|  4.97e-25    76.106  119 0   AF156149      114741
OTU_145 gb|AF156149.1|  2.28e-33    78.319  147 0   AF156149      114741
OTU_27  gb|AF156151.1|  2.36e-18    84.000  97.1    0   AF156151      114754

I receive only

OTU_8   dbj|AB021887.1| 3.04e-84    100.000 315 0   AB021887      7936
OTU_124 gb|AF156149.1|  4.97e-25    76.106  119 0   AF156149      114741
OTU_27  gb|AF156151.1|  2.36e-18    84.000  97.1    0   AF156151      114754

I would greatly appreciate any help.

tripleee
  • 175,061
  • 34
  • 275
  • 318

2 Answers2

3

a[$7]=$0; next simply remembers the last value for $7 and overwrites any previous value. You can fix this by turning around the relationship (assuming you have no duplicates in the other file, of course).

awk 'NR==FNR{a[$1]=$3; next} ($7 in a) {print $0, a[$7]}' TableB TableA > TableC
tripleee
  • 175,061
  • 34
  • 275
  • 318
2

I would use join for that :

join -1 7 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.3 <(sort tableA -k7) <(sort tableB -k1)

Don't forget to sort input files, -1 7 option makes the join on the seventh field of tableA, -ooption orders the output columns

Output :

OTU_142 dbj|AB021887.1| 5.05e-82 99.412 307 0 AB021887 7936
OTU_8 dbj|AB021887.1| 3.04e-84 100.000 315 0 AB021887 7936
OTU_124 gb|AF156149.1| 4.97e-25 76.106 119 0 AF156149 114741
OTU_145 gb|AF156149.1| 2.28e-33 78.319 147 0 AF156149 114741
OTU_27 gb|AF156151.1| 2.36e-18 84.000 97.1 0 AF156151 114754
Corentin Limier
  • 4,946
  • 1
  • 13
  • 24