0

I have the following pattern:

$ echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114"
1>1>659,659>659>660
1>1>683,683>683>684
1>1>712,712>712>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

I want to replace patterns where the same numbers appear sequentially between commas and the larger than (>) sign. So, to identify with grep I would do:

$ echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" |
grep -Eo "([0-9]+),\1>\1"

659,659>659
683,683>683
712,712>712

That is two back-references to the same group.

I know that using gensub() in awk I can have back-references in the replacement field. But how could I have that in the regexp field? Something like this:

result = gensub(/([0-9]+),\\1>\\1/,"my replaced string", "g", string)

How can I achieve that?

anubhava
  • 761,203
  • 64
  • 569
  • 643
Adriano_Pinaffo
  • 1,429
  • 4
  • 23
  • 46

2 Answers2

1

Here is a sed solution that does the trick.

sed 's|\([0-9]\+\),\1>\1|Replaced string|g'

echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" | sed 's|\([0-9]\+\),\1>\1|Replaced string|g'
1>1>Replaced string>660
1>1>Replaced string>684
1>1>Replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

Hope you can live with sed instead of awk

But if awk is mandatory here is an awkward awk script for this.

awk -F "[>,]" '{sub($3","$3">"$3,"Replaced string")}1'

echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" | awk -F "[>,]" '{sub($3","$3">"$3,"Replaced string")}1'
1>1>Replaced string>660
1>1>Replaced string>684
1>1>Replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

If you want to validate that 3rd field is always numeric. Add the following condition:

awk -F "[>,]" '$3 ~ "^[0-9]+$"{sub($3","$3">"$3,"Replaced string")}1'
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
  • Good to know that backreferences work with sed, but I need it in awk. Ed Morton answer did it.... but I gave an upvote because this is probably useful. – Adriano_Pinaffo Nov 19 '20 at 00:14
1

Awk does not support backreferences in a regexp because to do so would require a much slower regexp engine than awk uses (see https://swtch.com/~rsc/regexp/regexp1.html) and it's not necessary and rarely desired. This may be what you're trying to do, using GNU awk for the 3rd arg to match():

$ awk 'match($0,/([0-9]+),/,a){ sub(a[1]","a[1]">"a[1],"my replaced string") } 1' file
1>1>my replaced string>660
1>1>my replaced string>684
1>1>my replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

or with any awk:

$ awk 'match($0,/([0-9]+),/){ a=substr($0,RSTART,RLENGTH-1); sub(a","a">"a,"my replaced string") } 1' file
1>1>my replaced string>660
1>1>my replaced string>684
1>1>my replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    that is perfect. I was so focused on gensub() that I totally forgot about match(). I was about to say that your gawk approach works with awk as well but I had not realized that when you have gawk installed "awk" is just a symlink to gawk. But it works, thanks – Adriano_Pinaffo Nov 19 '20 at 00:11