awk match() multiple matches

Question

I have the following:

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{match($0,/ZZ:Z[^ ]*/,m); print m[0], m[1]}'

which unfortunately outputs only the first entry (out of two):

ZZ:Z:mus.sup

It looks to me that the match() function is incapable of storing more than one match into its array. Unless I'm missing here something...?

If this is indeed the case, would someone kindly suggest an awk-based 'matching' alternative that will allow to obtain the two ZZ:Z entries. Note, that these are NOT located each time at the same column(!) - hence the need of using the match() function.

The general idea here is to obtain at the same awk command some values that appear at known column positions (e.g. col1, col2), and some values (fetched based on their unique signature "ZZ:Z") that located at unknown indexed columns.

In addition, the following attempt - using gensub() also fails to output/print the two ZZ:Z entries, and identify only one of the two (and the other one upon deprecation of the reciprocal..)

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{val= gensub(/.*(ZZ:Z[^ ]*).*/,"\\1 \\2","g",$0);print val}'

the result in this case is:

ZZ:Z:cas.sup

but I'd like to have as result:

ZZ:Z:mus.sup ZZ:Z:cas.sup

Ed Morton · Accepted Answer · 2016-11-13T22:23:26.457

4

You were just calling the wrong function, you should be using split() not match():

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{split($0,t,/ZZ:Z[^ ]*/,m); print m[1], m[2]}'
ZZ:Z:mus.sup ZZ:Z:cas.sup

or to print any number of occurrences in the order they appeared in the input:

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{split($0,t,/ZZ:Z[^ ]*/,m); for (i=1; i in m; i++) print m[i]}'
ZZ:Z:mus.sup
ZZ:Z:cas.sup

That uses GNU awk for the 4th arg to split() just like you were using GNU awk for the 3rd arg to match().

If you had to do this in a non-GNU awk it'd just be:

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{while(match($0,/ZZ:Z[^ ]*/)) {print substr($0,RSTART,RLENGTH); $0=substr($0,RSTART+RLENGTH)}}'
ZZ:Z:mus.sup
ZZ:Z:cas.sup

edited Nov 13 '16 at 22:23

answered Nov 13 '16 at 14:31

Ed Morton

188,023
17
78
185

Could you please check your solution, it might be a problem only at my end, but I do get an error message: " awk: fatal: 4 is invalid as number of arguments for split " – Roy Nov 13 '16 at 20:53
You need to use GNU awk 4.0 or more recent. If you're on an older version than that you need to update ASAP as 4.0 has been around for 5+ years (4.0.0 came out June 2011, we're now on version 4.1.4!) and you're missing a ton of extremely useful functionality and bug fixes (see https://www.gnu.org/software/gawk/manual/gawk.html#Feature-History) – Ed Morton Nov 13 '16 at 22:12

score 3 · Answer 2 · answered Nov 13 '16 at 01:51

The results of match can be used to get the unmatched portion for additional matching:

{
        for (s = $0; match(s, /ZZ:Z[^ ]*/);
            s = substr(s, RSTART + RLENGTH, length))
                printf("%s%s", s == $0 ? "" : " ", 
                    substr(s, RSTART, RLENGTH))
        print ""
}

Alternatively, the string can be split on the unique identifier, either with split or FS:

{
        l = split($0, a, /ZZ:Z/)
        for(i = 2; i <= l; i++)
                printf("%s%s", i == 2 ? "" : " ",
                    "ZZ:Z" substr(a[i], 1, index(a[i], " ") - 1))
        print ""
}

this is a neat general solution, can work with as many appearances as they come - very nice! tnx — Roy, Nov 13 '16 at 02:41

Roy · Answer 3 · 2016-11-13T03:00:42.033

Thanks, the above solutions are great and provide generalized solving for the problem - no matter how many time the ZZ:Z entry repeats in the original line.

This is however the one liner I was aiming for, which is a fix to the wrong matching condition I was using above:

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{val= gensub(/.*(ZZ:Z[^ ]*).*(ZZ:Z[^ ]*).*/,"\\1 \\2","g");print val}'

output:

ZZ:Z:mus.sup ZZ:Z:cas.sup

Also, this is the solution - using awk's match() :

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{match($0,/.*(ZZ:Z[^ ]*).*(ZZ:Z[^ ]*).*/,m); print m[1], m[2]}'

No, neither of those is the right solution. See http://stackoverflow.com/a/40574948/1745001. — Ed Morton, Nov 13 '16 at 14:32

awk match() multiple matches

3 Answers3

Linked