backreferencing in awk gensub with conditional branching

Question

I'm referencing to answer to: GNU awk: accessing captured groups in replacement text but whith ? Quantifier for regex matching

I would like to make if statement or ternary operator ?: or something more elegant so that if regex group that is backreferenced with \\1 returns nonempty string then, one arbitrary string (\\1 is not excluded) is inserted and if it returns empty string some other arbitrary string is inserted. My example works when capturing group returns nonempty string, but doesn't return expected branch "B" when backreference is empty. How to make conditional branching based on backreferenced values?

echo abba | awk '{ print gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0)}'

It's not going to work since the evaluation of `"\\1"` will be always true, non blank string. — karakfa, Dec 19 '21 at 22:23
[edit] your question to show a minimal reproducible example with concise, testable sample input and expected output that demonstrates your problem so we can help you. Assuming you're you trying to convert `XabbaXaaXabaX` to `XAXBXAX,`, for example, make sure to include interesting cases like `abbaabba` and `aabbaabba` and `abbaaabba` in your sample input/output so we can see how you expect overlapping matches of the regexp handled. See [my answer](https://stackoverflow.com/a/70416020/1745001) for a possible sample input/output you could adapt to whatever your requirements are. — Ed Morton, Dec 19 '21 at 23:15

score 1 · Answer 1 · answered Dec 19 '21 at 22:26

1

you can do the assignment in the gensub and use the value for the ternary operator afterwards, something like this

... | awk '{ v=gensub(/a(b*)?a/, "\\1", "g", $0); print v?"A":"B"}'

answered Dec 19 '21 at 22:26

karakfa

66,216
7
41
56

I think `a(b*)?a` is the same as `a(b*)a` but in either case that would print just `A` or `B`, not replace every string in the record that matches the regexp with `A` or `B` while retaining the surrounding context as the OP seems to be trying to do with that `gensub()`. – Ed Morton Dec 19 '21 at 22:55

score 0 · Answer 2 · answered Dec 19 '21 at 22:25

0

Something like this, maybe?:

$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< abba
A

$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< acca
B

answered Dec 19 '21 at 22:25

James Brown

36,089
7
43
59

That regexp is too greedy, consider input like `XabbaXaaXabaX`, and even if it weren't it'd replace all `ab*a`s with the result of matching the first one, not treat each individually. – Ed Morton Dec 19 '21 at 22:53
Yeah, it's not of any practical use outside of showing AN implementation of what I thought OP was asking (and to keep me awake while watching this movie...). – James Brown Dec 19 '21 at 23:00

Ed Morton · Answer 3 · 2021-12-19T23:38:20.453

The expressions in any arguments you pass to any function are evaluated before the function is called so gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0) is the same as str=("\\1"?"A":"B"); gensub(/a(b*)?a/, str, "g", $0) which is the same as gensub(/a(b*)?a/, "A", "g", $0).

So you cannot do what you're apparently trying to do with a single call to any function, nor can you call gsub() twice, once with ab+a and then again with aa, or similar without breaking the left-to-right, leftmost-longest order in which such a replacement function would match the regexp against the input if it existed.

It looks like you might be trying to do the following, using GNU awk for patsplit():

awk '
    n = patsplit($0,f,/ab*a/,s) {
        $0 = s[0]
        for ( i=1; i<=n; i++ ) {
            $0 = $0 (f[i] ~ /ab+a/ ? "A" : "B") s[i]
        }
    }
1'

or with any awk:

awk '
    {
        head = ""
        while ( match($0,/ab*a/) ) {
            str = substr($0,RSTART,RLENGTH)
            head = head substr($0,1,RSTART-1) (str ~ /ab+a/ ? "A" : "B")
            $0 = substr($0,RSTART+RLENGTH)
        }
        $0 = head $0
    }
1'

but without sample input/output it's a guess. FWIW given this sample input file:

$ cat file
XabbaXaaXabaX
foo
abbaabba
aabbaabba
bar
abbaaabba

the above will output:

XAXBXAX
foo
AA
BbbBbba
bar
ABbba

backreferencing in awk gensub with conditional branching

3 Answers3