Substitute patterns using a correspondence file

Question

I try to change in a file some word by others using sed or awk.

My initial fileA as this format:

>Genus_species|NP_001006347.1|transcript-60_2900.p1:1-843

I have a second fileB with the correspondences like this:

NP_001006347.1 GeneA
XP_003643123.1 GeneB

I am trying to substitute in FileA the name to get this ouput:

>Genus_species|GeneA|transcript-60_2900.p1:1-843

I was thinking to use awk or sed, to do something like 's/$patternA/$patternB/' with a while read l but how to indicate which pattern 1 and 2 are in the fileB? I tried also this but not working.

sed "$(sed 's/^\([^ ]*\) \(.*\)$/s#\1#\2#g/' fileB)" fileA

Awk may be able to do the job more easily?

Thanks

score 4 · Answer 1 · answered Dec 08 '20 at 19:54

4

It is easier to this in awk:

awk -v OFS='|' 'NR == FNR {
   map[$1] = $2
   next
}
{
   for (i=1; i<=NF; ++i)
      $i in map && $i = map[$i]
} 1' file2 FS='|' file1

>Genus_species|GeneA|transcript-60_2900.p1:1-843

answered Dec 08 '20 at 19:54

anubhava

761,203
64
569
643

score 2 · Answer 2 · answered Dec 08 '20 at 20:24

Written and tested with your shown samples, considering that you have only one entry for NP_digits.digits in your Input_fileA then you could try following too.

awk '
FNR==NR{
  arr[$1]=$2
  next
}
match($0,/NP_[0-9]+\.[0-9]+/) && ((val=substr($0,RSTART,RLENGTH)) in arr){
  $0=substr($0,1,RSTART-1) arr[val] substr($0,RSTART+RLENGTH)
}
1
'  Input_fileB  Input_fileA

score 1 · Accepted Answer · answered Dec 08 '20 at 19:54

Using awk

awk -F [\|" "] 'NR==FNR { arr[$1]=$2;next } NR!=FNR { OFS="|";$2=arr[$2] }1' fileB fileA

Set the field delimiter to space or |. Process fileB first (NR==FNR) Create an array called arr with the first space delimited field as the index and the second the value. Then for the second file (NR != FNR), check for an entry for the second field in the arr array and if there is an entry, change the second field for the value in the array and print the lines with short hand 1

This will break if `|` is found anywhere in file2 i.e. `NP|001006347.1 GeneA` — anubhava, Dec 08 '20 at 20:11

shilch · Answer 4 · 2020-12-08T20:03:09.530

You are looking for the join command which can be used like this:

join -11 -22 -t'|' <(tr ' ' '|' < fileB | sort -t'|' -k1) <(sort -t'|' -k2 fileA)

This performs a join on column 1 of fileB with column 2 of fileA. The tr was used such that fileB also uses | as delimiter because join requires it to be equal on both files.

Note that the output columns are not in the order you specified. You can swap by piping the output into awk.

Substitute patterns using a correspondence file

4 Answers4