AWK: how to have backreference \1 in gensub() function's regex field?

Question

I have the following pattern:

$ echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114"
1>1>659,659>659>660
1>1>683,683>683>684
1>1>712,712>712>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

I want to replace patterns where the same numbers appear sequentially between commas and the larger than (>) sign. So, to identify with grep I would do:

$ echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" |
grep -Eo "([0-9]+),\1>\1"

659,659>659
683,683>683
712,712>712

That is two back-references to the same group.

I know that using gensub() in awk I can have back-references in the replacement field. But how could I have that in the regexp field? Something like this:

result = gensub(/([0-9]+),\\1>\\1/,"my replaced string", "g", string)

How can I achieve that?

Assuming what you're doing is legal then it seems like it should be just `\1` not `\\1` — MonkeyZeus, Nov 18 '20 at 20:56
I am not quite sure whether you do accept `perl` but you could have `echo your_string | perl -pe "s/([0-9]+),\1>\1//"` — Onyambu, Nov 18 '20 at 22:07
I need this awk command to be part of a bigger program in awk, but Ed Morton's answer below did it (without gensub) — Adriano_Pinaffo, Nov 19 '20 at 00:19

Dudi Boy · Answer 1 · 2020-11-18T22:48:47.810

Here is a sed solution that does the trick.

sed 's|\([0-9]\+\),\1>\1|Replaced string|g'

echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" | sed 's|\([0-9]\+\),\1>\1|Replaced string|g'
1>1>Replaced string>660
1>1>Replaced string>684
1>1>Replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

Hope you can live with sed instead of awk

But if awk is mandatory here is an awkward awk script for this.

awk -F "[>,]" '{sub($3","$3">"$3,"Replaced string")}1'

echo -e "1>1>659,659>659>660\n1>1>683,683>683>684\n1>1>712,712>712>713\n1>1>1080648,1>1>1080660\n1>1>1081100,1>1>1081114" | awk -F "[>,]" '{sub($3","$3">"$3,"Replaced string")}1'
1>1>Replaced string>660
1>1>Replaced string>684
1>1>Replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

If you want to validate that 3rd field is always numeric. Add the following condition:

awk -F "[>,]" '$3 ~ "^[0-9]+$"{sub($3","$3">"$3,"Replaced string")}1'

Good to know that backreferences work with sed, but I need it in awk. Ed Morton answer did it.... but I gave an upvote because this is probably useful. — Adriano_Pinaffo, Nov 19 '20 at 00:14

Ed Morton · Accepted Answer · 2020-11-18T23:00:20.197

Awk does not support backreferences in a regexp because to do so would require a much slower regexp engine than awk uses (see https://swtch.com/~rsc/regexp/regexp1.html) and it's not necessary and rarely desired. This may be what you're trying to do, using GNU awk for the 3rd arg to match():

$ awk 'match($0,/([0-9]+),/,a){ sub(a[1]","a[1]">"a[1],"my replaced string") } 1' file
1>1>my replaced string>660
1>1>my replaced string>684
1>1>my replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

or with any awk:

$ awk 'match($0,/([0-9]+),/){ a=substr($0,RSTART,RLENGTH-1); sub(a","a">"a,"my replaced string") } 1' file
1>1>my replaced string>660
1>1>my replaced string>684
1>1>my replaced string>713
1>1>1080648,1>1>1080660
1>1>1081100,1>1>1081114

that is perfect. I was so focused on gensub() that I totally forgot about match(). I was about to say that your gawk approach works with awk as well but I had not realized that when you have gawk installed "awk" is just a symlink to gawk. But it works, thanks — Adriano_Pinaffo, Nov 19 '20 at 00:11

AWK: how to have backreference \1 in gensub() function's regex field?

2 Answers2

Linked