-2

I have TABLE1 where first column is a string which should be replaced in the TABLE2 and second column in the TABLE1 is the value which should replace the string.

TABLE1 looks as this:

g63. MYL9
g5990. PTC7
g6018. POLYUBQ
g17850. NAA50

Table 2 looks for example like this:

PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . g63.
PIZI01000001v1  AUGUSTUS    intron  751969  752021  1   -   .   transcript_id "g63.t1"; gene_id "g63.
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . g630.
PIZI01000001v1  AUGUSTUS    intron  16680415    16683083    0.35    +   .   transcript_id "g630.t1"; gene_id "g630.
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . g631.
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . g632.
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.

So I assembled the awk command

awk 'FNR==NR { array[$1]==$2; next } { for (i in array) gsub(i, array[i]) }1' TABLE1 TABLE

which works up to the limit that for example with value MYL9 is not replaced only the string g63. but also the strings like g630, g631, g632 ... g6300 ..... and so on. So the Final table would look like this

PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . MYL9
PIZI01000001v1  AUGUSTUS    intron  751969  752021  1   -   .   transcript_id "MYL9"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . MYL9
PIZI01000001v1  AUGUSTUS    intron  16680415    16683083    0.35    +   .   transcript_id "MYL9t1"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . MYL9
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . MYL9
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.

And I need it to edit jus g63. and not other like g630. and so on.

I spend quite long time with this and now I have to take pause, so if anybody has an idea whats wrong there, I would appreciate. Thanks

  • 2
    the current data sets do not have any matches (ie, none of the strings from `TABLE1` exist in `TABLE2`); please update the data sets to insure there are some matches; also update the question to show the expected results; please also make sure the sample data sets also demonstrate the issue you've mentioned in the latt paragraph (eg, includes some inputs lines with `g63` and `g630`) – markp-fuso Feb 02 '23 at 15:40
  • 1
    Why `==` and not `=`? Note also you should probably be doing string comparisons (`gsub` uses regex) – jhnc Feb 02 '23 at 16:12
  • if TABLE2 is large, for slight efficiency gain, you can break out of the for loop once gsub or string equivalent has succeeded – jhnc Feb 02 '23 at 16:16
  • see: https://stackoverflow.com/q/37039053/10971581 – jhnc Feb 02 '23 at 16:21
  • So your example should not have any replacements from Table 1 Row 1 because no `g63` appears in the data provided? – David C. Rankin Feb 03 '23 at 07:37

1 Answers1

0

So I solved the problem in non elegant way. I realized, that the dot on the end in the first line is handled as special character (any symbol) so I just replaced the dots with underscore.