For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
- 1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
- 2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
- use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
- assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix