awk extract multiple groups from each line

Question

How do I perform action on all matching groups when the pattern matches multiple times in a line?

To illustrate, I want to search for /Hello! (\d+)/ and use the numbers, for example, print them out or sum them, so for input

abcHello! 200 300 Hello! Hello! 400z3
ads
Hello! 0

If I decided to print them out, I'd expect the output of

200
400
0

For Googlers: note that with `gawk`, aka. "GNU awk", you can actually do what the title says (not the question) in one line (e.g. via. piping): `| gawk -v RS='' '{ print gensub(/()()/, "\\1\\2", "g"); }'` :D This supports multi-line (due to the `-v RS=''`) and matching sub-groups (due to using gawk's `gensub`)!!! — Andrew, Sep 13 '17 at 19:18

score 13 · Accepted Answer · answered Jul 13 '09 at 09:54

This is a simple syntax, and every awk (nawk, mawk, gawk, etc) can use this.

{
    while (match($0, /Hello! [0-9]+/)) {
        pattern = substr($0, RSTART, RLENGTH);
        sub(/Hello! /, "", pattern);
        print pattern;
        $0 = substr($0, RSTART + RLENGTH);
    }
}

score 2 · Answer 2 · answered Jul 12 '09 at 16:20

2

This is gawk syntax. It also works for patterns when there's no fixed text that can work as a record separator and doesn't match over linefeeds:

 {
     pattern = "([a-g]+|[h-z]+)"
     while (match($0, pattern, arr))
     {
         val = arr[1]
         print val
         sub(pattern, "")
     }
 }

answered Jul 12 '09 at 16:20

Adrian Panasiuk

7,249
5
33
54

That `sub` at the end makes a huge difference! Sadly it took me some time to try it out... Thanks! – Gustavo Vargas Oct 06 '18 at 16:19

score 1 · Answer 3 · answered Jul 12 '09 at 15:31

1

GNU awk

awk 'BEGIN{ RS="Hello! ";}
{
    gsub(/[^0-9].*/,"",$1)
    if ($1 != ""){ 
        print $1 
    }
}' file

answered Jul 12 '09 at 15:31

ghostdog74

327,991
56
259
343

Nice, but won't work for more complex patterns like /([a-g]+|[h-z]+)/ and will match over a linefeed. – Adrian Panasiuk Jul 12 '09 at 16:18

CsTamas · Answer 4 · 2009-07-27T07:23:27.373

There is no gawk function to match the same pattern multiple times in a line. Unless you know exactly how many times the pattern repeats.

Having this, you have to iterate "manually" on all matches in the same line. For your example input, it would be:

{
  from = 0
  pos = match( $0, /Hello! ([0-9]+)/, val )
  while( 0 < pos )
  {
    print val[1]
    from += pos + val[0, "length"]
    pos = match( substr( $0, from ), /Hello! ([0-9]+)/, val )
  }
}

If the pattern shall match over a linefeed, you have to modify the input record separator - RS

awk extract multiple groups from each line

4 Answers4

Linked

Related