Extract multiple independent regex matches per line

Question

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:

1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"

The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:

    MOUSE_10        XC:Z:TGGTCGGCGCGT       RG:Z:A  XM:Z:GAGTCCGT   ZP:i:33
    MOUSE_10        XC:Z:GAAGCCGCTTCC       NM:i:0  XM:Z:ACCGACGG   AS:i:16
    MOUSE_10        ZP:i:36 XC:Z:TCCCCGGGTACA       NM:i:0  XM:Z:GGGACGGG   ZP:i:28
    MOUSE_10        XC:Z:CAAATTTGGAAA       RG:Z:A  NM:i:1  XM:Z:GCAGATAG

In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:

use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.

The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.

Looking forward to your solutions!

Thanks,

Felix

You should show your code that comes up with the error — we can probably fix that easily. — Jonathan Leffler, Nov 20 '17 at 15:39
Would you ever get 3 lots of `XC:Z:` and 2 lots of `XM:Z:` on a single line? Can you have one pattern without the other? What exactly is the required output — should the prefix be preserved? Are you wanting one line per pattern in the output, even if there are 2 or more matches in a single input line, so the total number of lines in the output could be greater than the number of lines in the input. It isn't hard to do; it is just a question of working out exactly what you want done. Producing an MCVE ([MCVE]) with sample output data too (the input shown is good) helps. — Jonathan Leffler, Nov 20 '17 at 15:41
@JonathanLeffler I expect exactly one occurence of each string (XC:Z: and XM:Z:) per line, thanks for the clarifying question. I agree the complete output on top of the two supplied examples would get closer to a real MCVE, sorry for not having added it! — Felix, Nov 20 '17 at 16:44
Why not add it now? wrt your syntax error - either you're running old, broken awk or you're calling awk from the command line and bash is interpreting the `!`. Without knowing more about your environment and what you're executing and your expected output, etc. we can't help you much. — Ed Morton, Nov 20 '17 at 17:57

SLePort · Answer 1 · 2017-11-21T06:46:15.227

1

With sed you can capture non-space characters after XC:Z: and XM:Z:

sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file

You can add a second s command for reversed values:

sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file

edited Nov 21 '17 at 06:46

answered Nov 20 '17 at 15:43

SLePort

15,211
3
34
44

Thanks @SLePort, this does for me what it should. I understand this assumes XC:Z: is always before 'XM:Z:', is that correct? – Felix Nov 20 '17 at 21:58
I edited to cover the case where `XM:Z` is before `XC:Z`. – SLePort Nov 21 '17 at 05:22

score 0 · Answer 2 · answered Nov 20 '17 at 15:41

0

Following awk solution may help you in same.

awk '
/XC:Z:/{
  match($0,/XC:[^ ]*/);
  num=split(substr($0,RSTART,RLENGTH),a,":");
  match($0,/XM:[^ ]*/);
  num1=split(substr($0,RSTART,RLENGTH),b,":");
  print a[num],b[num1]
}'   Input_file

Output will be as follows.

TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG

answered Nov 20 '17 at 15:41

RavinderSingh13

130,504
14
57
93

Thanks RavinderSingh13, I like your answer and would like to understand it a little better. Can you explain which values a,b and num,num1 are taking? I assume a and b are arrays that are being created inside the split command, is that correct? – Felix Nov 20 '17 at 22:21

karakfa · Answer 3 · 2017-11-20T22:00:15.760

another awk

$ awk '{c=p="";                               # need to reset c and p before each line
        for(i=1;i<=NF;i++)                    # for all fields in the line
          if($i~/^XC:Z:/) c=substr($i,6)      # check pattern from the start of field
          else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern 
        if(c && p) print c,p}' file           # if both matched print

TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG

this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.

$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
      {c=p="";
       for(i=1;i<=NF;i++) {
         c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
         if(c && p) 
           {print c,p; next}}}' file

TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG

this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.

Hi @karakfa, really like your solution1, seems to work well on my file, thanks. Three things are unclear to me. First, shouldn't we reset c and p (c=p="") inside the for-loop rather than before it? Second, why do you use 'else if' instead of 'if'? The third point is: does the '^'-sign not usually stand for the start-of-line, if so how does it work here although there's always at least one column ("Mouse_10") before it? (#3 is me being too lazy to google, feel free to ignore if you're unmotivated to explain regex to me - but #1 and #2 would be good to resolve!) — Felix, Nov 20 '17 at 21:52
added explanation. `if else` is better since it can only match one pattern at a time, if it matched the previous pattern, there is no need to check the other one. If this solution solved your question, you'll need the up vote and/or select as the answer. — karakfa, Nov 20 '17 at 22:01

score 0 · Answer 4 · answered Nov 20 '17 at 18:18

If we don't know the order in which XC and XM appear You can try this sed

sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile

explanation :

sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile

This code does what is should, and the explanations are very useful, thanks! And I'm in awe in face of your incredibly sed-skills, very powerful! — Felix, Nov 20 '17 at 21:34

score 0 · Answer 5 · answered Nov 21 '17 at 17:21

Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :

match(s, ere)

Return the position, in characters, numbering from 1, in string s where the extended regular expression ere occurs, or zero if it does not occur at all. RSTART shall be set to the starting position (which is the same as the returned value), zero if no match is found; RLENGTH shall be set to the length of the matched string, -1 if no match is found.

The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.

The following code does the above:

   awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
        {match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
        {print xc","xm}' <file>

As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".

Remark - The following assumptions are made here :

each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist

If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.

The following lines do exactly the same in gawk

   gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
         {xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
         {print substr(xc,5)","substr(xm,5)' <file>

Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

Extract multiple independent regex matches per line

5 Answers5