Filter column with awk and regexp

Question

I've a pretty simple question. I've a file containing several columns and I want to filter them using awk.

So the column of interest is the 6th column and I want to find every string containing :

starting with a number from 1 to 100
after that one "S" or a "M"
again a number from 1 to 100
after that one "S" or a "M"

So per example : 20S50M is ok

I tried :

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

but it didn't work... What am I doing wrong?

Chris Seymour · Accepted Answer · 2013-09-23T15:15:29.097

This should do the trick:

awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file

Regexplanation:

^                        # Match the start of the string
(([1-9]|[1-9][0-9]|100)  # Match a single digit 1-9 or double digit 10-99 or 100
[SM]                     # Character class matching the character S or M
){2}                     # Repeat everything in the parens twice
$                        # Match the end of the string

You have quite a few issue with your statement:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

== is the string comparision operator. The regex comparision operator is ~.
You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
[0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
[SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.

Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.

Thanks a lot that's great ! I've only a last problem. I want to add this to a bash script but the output is empty. When I try the command in a shell, it works well. I use the awk command in a pipe with the output of an another program. command | awk '$6~/^(|[0-3][ID]){2}(([7-9]|[1-9][0-9]|100)[SM])(|[0-3][ID]){2}(([7-9]|[1-9][0-9]|100)[SM])(|[0-3][ID]){2}$/' > out.txt — Nicolas Rosewick, Sep 24 '13 at 09:40
There is absolutely no reason why that awk script would behave differently in a shell script vs on the command line (assuming the same shell in both). Chances are there's a bug earlier in your shell script. Update your question to show a copy/paste of what you are doing on the command line and with the shell script including and the contents of your shell script so we can help you identify the problem. — Ed Morton, Sep 24 '13 at 14:07
+1 for the concept of "regexplanation". I know it's thread necromancy, but I'm deep in a regex hole at the moment and it made me smile. :) — Jenn D., Oct 19 '18 at 17:42

score 2 · Answer 2 · answered Sep 23 '13 at 14:42

2

Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."

You want something like this

/\d{1,3}[SM]\d{1,3}[SM]/

Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).

answered Sep 23 '13 at 14:42

Andy Lester

91,102
13
100
152

`"A number from 1 to 100" is outside what regexes can do` as a single character class you cannot, using regex you certainly can. – Chris Seymour Sep 23 '13 at 14:43
What you did was not checking the numeric value. Your answer looks for a 1-digit number, or a 2-digit number, or a literal 100. That isn't checking numeric value. It just fakes it. – Andy Lester Sep 23 '13 at 14:44
My answer use regular expression to validate digits in the range 1 - 100. I clearly state this in my comment that it cannot be achieve with a single character class and explain the different between character class and numeric ranges in my answer. Your solution isn't anchored, allows 0 values and value over 100 and doesn't compare against the 6 field either. – Chris Seymour Sep 23 '13 at 14:54

score 2 · Answer 3 · answered Sep 23 '13 at 16:21

I would do the regex check and the numeric validation as different steps. This code works with GNU awk:

$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx

We'd expect only the 3rd line to pass validation

$ gawk '
    match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
    1 <= m[1] && m[1] <= 100 && 
    1 <= m[2] && m[2] <= 100 {
        print
    }
' data
a b c d e 12S23M

For maintainability, you could encapsulate that into a function:

gawk '
    function validate6() {
        return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
                1<=m[1] && m[1]<=100 && 
                1<=m[2] && m[2]<=100 );
    }
    validate6() {print}
' data

+1 for the only easily extendable solution so far if the OP means something other than `positive integer` when he says `number`! — Ed Morton, Sep 23 '13 at 19:17

score 1 · Answer 4 · answered Sep 23 '13 at 16:28

The way to write the script you posted:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

in awk so it will do what you SEEM to be trying to do is:

awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt

Post some sample input and expected output to help us help you more.

score 0 · Answer 5 · edited May 17 '20 at 14:25

0

Try this:

awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S'; and exclude all else. For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.

If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match. You can use the above pattern but exclude single digits (remove the first [1-9] condition), e.g.

awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

edited May 17 '20 at 14:25

agc

7,973
2
29
50

answered Sep 23 '13 at 18:20

Andrew

906
7
9

`[S|M]` means `either of the letters "S", "|", or "M"`. There's a few briefer REs already posted that do the job the OP seems to want done. – Ed Morton Sep 23 '13 at 18:31
Ed - I answered the question correctly. I used your answer (which is a copy of Sudo_O) and got no output. The question is not only about regexp, much more importantly, it must actually generate output with awk to answer NicoBxl question. – Andrew Sep 23 '13 at 18:41
My answer is not a copy of @sudo_O's (read them again) and if you got no output then either your input is wrong or your awk doesn't support RE intervals, in which case get a newer awk. Your answer is incorrect because it will match strings that are not in the desired format - excluding similar but invalid strings is always much harder to get correct when writing REs than simply matching the desired strings. Try it with a $6 value of `12|23|` or even `12345678|98647329|` in the input file. – Ed Morton Sep 23 '13 at 18:45
I am on CentOS 6.4. Sorry if tools provided on that OS aren't new enough to support your answer. Again, I am trying to help the questioner solve a real world problem. Looking like an 'expert' or earning SO points isn't my goal here. Did you create a test input file? I did. I ran your expression and it doesn't work using awk on CentOS 6.4 - not without more effort for the poster (you) to help solve. – Andrew Sep 23 '13 at 18:57
again, if you can't get my solution to work and you are sure that your input is correct then you are using an old and/or broken version of awk that is not even POSIX compliant. Seriously, get a new one and save yourself more headaches down the road. If its an old version of gawk then for now you could add the `--re-interval` option. You posted a solution that does not work. I pointed out one of the problems with it. Don't get so defensive. – Ed Morton Sep 23 '13 at 19:02

Joyce Quach · Answer 6 · 2019-08-01T01:16:53.743

I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.

I wrote a Python script that takes in the column $6 from a SAM/BAM file:

import sys                      # getting standard input
import re                       # regular expression module

lines = sys.stdin.readlines()   # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1                     # complements id from filter_1.txt

# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs: 
# "49M1S" produces total=50
# "10M757N40M" produces total=50

for line in lines:
    all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
    for n in all_ints:
        total += n
    print(str(read_id)+ ' ' + str(total))
    read_id += 1
    total = 0

The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.

I hope this helps, or at least helps the next user that has a similar issue. I consulted https://stackoverflow.com/a/11339230 for reference.

Filter column with awk and regexp

6 Answers6

Linked