298

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

ira wati
  • 89
  • 1
  • 10
rampion
  • 87,131
  • 49
  • 199
  • 315
  • 3
    http://stackoverflow.com/questions/1555173/gnu-awk-accessing-captured-groups-in-replacement-text – lt1776 Jan 12 '11 at 18:12
  • 1
    Sometimes (in simple cases) it's possible to adjust the field separator (`FS`) and pick what one would like to match with a `$field`. Preformatting the input could help too. – Krzysztof Jabłoński Jul 01 '15 at 17:06
  • 1
    There is a [better answer](http://stackoverflow.com/a/10254791/894885) on the duplicate question. – Samuel Edwin Ward Jul 08 '15 at 16:04
  • 4
    Samuel Edwin Ward: That's a nice answer too! But it also requires `gawk` (since it uses `gensub`). – rampion Jul 08 '15 at 17:39
  • Needless to say, if you're doing a simple transform, sed handles capture groups quite naturally. – Rob Aug 30 '21 at 00:13

7 Answers7

423

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}' 

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

Thor
  • 45,082
  • 11
  • 119
  • 130
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
224

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

Peter Tillemans
  • 34,983
  • 11
  • 83
  • 114
  • 4
    Apparently someone disagrees. This web page is from 2005 : http://www.tek-tips.com/faqs.cfm?fid=5674 It confirms that you cannot reuse matched groups in awk. – Peter Tillemans Jun 02 '10 at 13:00
  • [this article](http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/) seems to agree with you too. – rampion Jun 02 '10 at 13:10
  • 1
    As the tek-tips article states, gawk can re-use capture groups. – Dennis Williamson Jun 02 '10 at 14:00
  • 5
    I prefer 'perl -n -p -e...' over awk for almost all use cases, since it is more flexible, more powerful and has a saner syntax in my opinion. – Peter Tillemans Jun 23 '11 at 18:39
  • 21
    `gawk` != `awk`. They're different tools and `gawk` isn't available by default in most places. – Oli Sep 04 '12 at 12:21
  • Thanks for the syntax. `&&` and `;` made great differences!! – leesei May 21 '15 at 16:52
  • 11
    The OP specifically asked for an awk solution, so I don't think this is an answer. – Joppe Feb 22 '16 at 16:22
  • 15
    @Joppe you can't give an awk solution if there is no solution. In line 3 I explain that AWK does not support capturing groups and I gave an alternative, which the OP apparently appreciated because this answer was accepted. How could I better answer this question? – Peter Tillemans Mar 09 '16 at 07:54
  • @famousgarkin I keep forgetting Perl for the same reasons I still use grep and/or cut instead of awk: I build up long commands incrementally. And I sometimes have some vague idea that it matters that Perl is larger than grep/awk. – android.weasel Oct 20 '17 at 14:02
  • Be aware that `perl -p`/`-n` internally uses the `<>` operator, that has a "feature" that interprets filenames ending with `|` as commands, and truncates files starting with `>`. See [here](https://www.effectiveperlprogramming.com/2015/05/use-perl-5-22s-operator-for-safe-command-line-handling/). – Socowi Aug 13 '21 at 14:17
  • I think the major issue is that the manual page for `awk` explains for the regular expression syntax that "grouping" is supported, but it says **very little** about *how to use groups*. All it says is: "*Grouping: matches r.*" – U. Windl Jan 24 '22 at 10:48
  • @Joppe, You are factually incorrect; he did answer it: "the AWK regular expression engine does not capture its groups." Whether you agree with the answer or not is irrelevant, as is the fact that the author also added an alternative (in **addition to his answer**), which happens all the time on SO, and is useful for some people. – SO_fix_the_vote_sorting_bug Feb 26 '22 at 15:46
36

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1
opsb
  • 29,325
  • 19
  • 89
  • 99
20

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/
Isvara
  • 3,403
  • 1
  • 28
  • 42
  • 5
    That's [what glenn jackman's answer says](http://stackoverflow.com/a/4673336/9859), pretty much. – rampion Nov 29 '12 at 13:02
  • 1
    Ed Morton: that deserves a top-level answer I'd say. edit: uhm... that prints `RewriteRule (.*) http://www.mysite.net/$` for me, which is more than the subgroup. – rampion Nov 29 '12 at 13:02
  • 3
    [Looks like `RSTART` and `RLENGTH` refer to the substring matched by the pattern](http://www.grymoire.com/Unix/Awk.html#uh-47) – rampion Nov 29 '12 at 13:10
  • @EdMorton - no, that will select the whole line that contains `http...` pattern – KFL Dec 24 '20 at 06:54
  • @KFL you're right but actually there's a worse problem that the posted answer (and my suggestion to make it not gawk-specific) both contain `.*?` which is a PCRE-ism and undefined behavior in an ERE. I'll delete my comment. – Ed Morton Dec 24 '20 at 13:41
7

NOTE: the use of gensub is not POSIX compliant

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad
KFL
  • 17,162
  • 17
  • 65
  • 89
ydrol
  • 139
  • 1
  • 3
  • 5
    I'm almost certain that `gensub` is a `gawk` specific function. What do you get from your awk if you type `awk --version` ;-?). Good luck to all. – shellter Apr 13 '12 at 05:28
  • 8
    I'm fully certain that gensub is a gawk-ism, though BusyBox awk also has it. This answer could also be implemented using gsub, though: `echo 'ab cb ad' | awk '{gsub(/a./,SUBSEP"&"SUBSEP);split($0,cap,SUBSEP);print cap[2]"|"cap[4]}'` – dubiousjim Apr 19 '12 at 01:05
  • 4
    gensub() is a gawk extension, gawk's manual clearly say so. Other awk variants may also implement it, but it is still not POSIX. Try gawk --posix '{gsub(...)}' and it will complain – MestreLion Apr 21 '12 at 05:19
  • 2
    @MestreLion, you mean it will complain for `gawk --posix '{gensub(...)}'`. – dubiousjim Apr 24 '12 at 00:08
  • @dubiousjim: oops, yes, `gensub()`, sorry for the typo – MestreLion Apr 24 '12 at 02:25
  • 2
    Despite you were wrong about **POSIX awk** having the `gensub` function, your example applied to a very limited scenario: the whole pattern is grouped, it can't match something like all `key=(value)` when I want to extract only the `value` parts. – Meow Sep 24 '15 at 13:24
  • 2
    Enough people have commented about "gensub is a gawk-ism". Why not edit your answer at least? – Juan May 26 '18 at 01:23
2

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

function regex { perl -n -e "/$1/ && printf \"%s\n\", "'$1' }

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms$'
wytten
  • 2,800
  • 1
  • 21
  • 38
  • I prefer this solution, since you can see the parts of the group that delimit the capture, while also omitting them. However, could someone elxplain how this works? I can't get this perl syntax to work properly in BASH, because I don't understand it very well - especially the double/single-quote marks around `$1` – Demis Dec 19 '17 at 18:39
  • It is not something I have done before or since, but looking back what it is doing is concatenating two strings, the first string being in double quotes (this first string contains embedded double quotes escaped with backslash) and the second string being in single quotes. Then the result of that concatenation is supplied as argument to perl -e. Also you need to know that the first $1 (the one within double quotes) is substituted with the first argument to the function, while the second $1 (the one within single quotes) is left untouched. See [this example](https://i.imgur.com/Bfp2TmA.png) – wytten Dec 19 '17 at 23:01
  • I see, that's making a bit more sense now. So where in the perl command is the regex match/group capture definition? I see you wrote `'([0-9]*)ms$'` - is that supplied as an argument (and the string another argument)? And the output from `perl -e` is being inserted into bash's `printf` command then, to replace `%s`, is that right? Thanks, I am hoping to use this. – Demis Dec 20 '17 at 23:55
  • 1
    You pass a regular expression enclosed in single quotes as the sole argument to the regex bash function. [Example](https://i.imgur.com/71UKj52.png) – wytten Dec 21 '17 at 13:51
  • I downvoted because the question asks about awk so this answer is irrelevant. – bfontaine Dec 13 '21 at 21:33
0

i think gawk match()-to-array is only for first instance of the capture group.

if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps

gawk 'BEGIN { S = SUBSEP 
          } { 
              nx=split(gensub(/(..(..)..(..))/, 
                              "\\1"(S)"\\2"(S)"\\3", "g", str), 
                       arr, S)
              for(x in nx) { perform-ops-over arr[x] } }'

This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().

by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :

Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.

$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }' 
$ 1 

Scenario 2 : Match also works against unicode-invalid patterns like this

$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352"); 
$                print RSTART, RLENGTH }' 
$ 1 2

Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)

$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }' 
$ 1

Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.

$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11