2

I have the following:

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{match($0,/ZZ:Z[^ ]*/,m); print m[0], m[1]}' 

which unfortunately outputs only the first entry (out of two):

ZZ:Z:mus.sup 

It looks to me that the match() function is incapable of storing more than one match into its array. Unless I'm missing here something...?

If this is indeed the case, would someone kindly suggest an awk-based 'matching' alternative that will allow to obtain the two ZZ:Z entries. Note, that these are NOT located each time at the same column(!) - hence the need of using the match() function.

The general idea here is to obtain at the same awk command some values that appear at known column positions (e.g. col1, col2), and some values (fetched based on their unique signature "ZZ:Z") that located at unknown indexed columns.

In addition, the following attempt - using gensub() also fails to output/print the two ZZ:Z entries, and identify only one of the two (and the other one upon deprecation of the reciprocal..)

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{val= gensub(/.*(ZZ:Z[^ ]*).*/,"\\1 \\2","g",$0);print val}'

the result in this case is:

ZZ:Z:cas.sup

but I'd like to have as result:

ZZ:Z:mus.sup ZZ:Z:cas.sup 
Roy
  • 723
  • 2
  • 8
  • 21

3 Answers3

4

You were just calling the wrong function, you should be using split() not match():

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{split($0,t,/ZZ:Z[^ ]*/,m); print m[1], m[2]}'
ZZ:Z:mus.sup ZZ:Z:cas.sup

or to print any number of occurrences in the order they appeared in the input:

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{split($0,t,/ZZ:Z[^ ]*/,m); for (i=1; i in m; i++) print m[i]}'
ZZ:Z:mus.sup
ZZ:Z:cas.sup

That uses GNU awk for the 4th arg to split() just like you were using GNU awk for the 3rd arg to match().

If you had to do this in a non-GNU awk it'd just be:

$ echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|
awk '{while(match($0,/ZZ:Z[^ ]*/)) {print substr($0,RSTART,RLENGTH); $0=substr($0,RSTART+RLENGTH)}}'
ZZ:Z:mus.sup
ZZ:Z:cas.sup
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Could you please check your solution, it might be a problem only at my end, but I do get an error message: " awk: fatal: 4 is invalid as number of arguments for split " – Roy Nov 13 '16 at 20:53
  • You need to use GNU awk 4.0 or more recent. If you're on an older version than that you need to update ASAP as 4.0 has been around for 5+ years (4.0.0 came out June 2011, we're now on version 4.1.4!) and you're missing a ton of extremely useful functionality and bug fixes (see https://www.gnu.org/software/gawk/manual/gawk.html#Feature-History) – Ed Morton Nov 13 '16 at 22:12
3

The results of match can be used to get the unmatched portion for additional matching:

{
        for (s = $0; match(s, /ZZ:Z[^ ]*/);
            s = substr(s, RSTART + RLENGTH, length))
                printf("%s%s", s == $0 ? "" : " ", 
                    substr(s, RSTART, RLENGTH))
        print ""
}

Alternatively, the string can be split on the unique identifier, either with split or FS:

{
        l = split($0, a, /ZZ:Z/)
        for(i = 2; i <= l; i++)
                printf("%s%s", i == 2 ? "" : " ",
                    "ZZ:Z" substr(a[i], 1, index(a[i], " ") - 1))
        print ""
}
kdhp
  • 2,096
  • 14
  • 15
  • this is a neat general solution, can work with as many appearances as they come - very nice! tnx – Roy Nov 13 '16 at 02:41
0

Thanks, the above solutions are great and provide generalized solving for the problem - no matter how many time the ZZ:Z entry repeats in the original line.

This is however the one liner I was aiming for, which is a fix to the wrong matching condition I was using above:

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{val= gensub(/.*(ZZ:Z[^ ]*).*(ZZ:Z[^ ]*).*/,"\\1 \\2","g");print val}'

output:

ZZ:Z:mus.sup ZZ:Z:cas.sup

Also, this is the solution - using awk's match() :

echo AS:i:0  UQ:i:0  ZZ:Z:mus.sup  NM:i:0  MD:Z:50  ZZ:Z:cas.sup  CO:Z:endOfLine|awk '{match($0,/.*(ZZ:Z[^ ]*).*(ZZ:Z[^ ]*).*/,m); print m[1], m[2]}'
Roy
  • 723
  • 2
  • 8
  • 21