Capturing Groups From a Grep RegEx

Question

I've got this script in sh (macOS 10.6) to look through an array of files:

files="*.jpg"
for f in $files
    do
        echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
        name=$?
        echo $name
    done

So far $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.

I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I would like to attack this from the *nix purist angle.

Ah, didn't mean to suggest that. I was just hoping that a solution could be found using a tool I'm specifically trying to learn here. If it's not possible to solve using `grep`, then `sed` would be great, if it's possible to solve using `sed`. — Isaac, Dec 12 '09 at 01:09
@martinclayton That'd be an interesting argument. I do really think sed, (or ed to be precise) would be older (and therefore purer? maybe?) unix because grep derives it's name from the ed expression g(lobal)/re(gular expression)/p(rint). — ffledgling, Mar 05 '13 at 15:18

Dennis Williamson · Accepted Answer · 2018-11-22T14:01:35.640

678

If you're using Bash, you don't even have to use grep:

files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files    # unquoted in order to allow the glob to expand
do
    if [[ $f =~ $regex ]]
    then
        name="${BASH_REMATCH[1]}"
        echo "${name}.jpg"    # concatenate strings
        name="${name}.jpg"    # same thing stored in a variable
    else
        echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
    fi
done

It's better to put the regex in a variable. Some patterns won't work if included literally.

This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.

You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:

123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz

To eliminate the second and fourth examples, make your regex like this:

^[0-9]+_([a-z]+)_[0-9a-z]*

which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:

^[0-9]+_([a-z]+)_[0-9a-z]*$

then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.

If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):

name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg

The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).

The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.

In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.

The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.

edited Nov 22 '18 at 14:01

answered Dec 12 '09 at 02:59

Dennis Williamson

346,391
90
374
439

78

In this answer I want to upvote the specific line that says "It's better to put the regex in a variable. Some patterns won't work if included literally." – Brandin Jan 09 '14 at 12:41
2

"It's better to put the regex in a variable. Some patterns won't work if included literally." - Why does it happens? Is there a way fix them? – Francesco Frassinelli Oct 12 '14 at 05:47
7

@FrancescoFrassinelli: An example is a pattern that includes white space. It's awkward to escape and you can't use quotes since that forces it from a regex to an ordinary string. The correct way to do it is to use a variable. Quotes can be used during the assignment making things much simpler. – Dennis Williamson Oct 12 '14 at 08:03
However, Bash regex doesn't support lazy matching. – jowie Sep 14 '15 at 08:41
\K operator is a lifesaver, when all you wanted was a quick one-liner for your directory operation. (my specific case was look for a name in a file, and make some directories as a result). so `grep -P -o "blah \K([stuffhere]+)" somefile | godosomethingwiththat` – smaudet Feb 09 '16 at 17:39
This simply doesn't work. No regex matches regardless of the regex or input string – Brandon Mar 14 '16 at 20:02
3

@Brandon: It does work. What version of Bash are you using? Show me what you're doing that doesn't work and perhaps I can tell you why. – Dennis Williamson Mar 14 '16 at 20:12
@DennisWilliamson `4.3.11(1)-release`. I literally copied the example verbatim. `echo "${name}.jpg"` echos ".jpg" – Brandon Mar 14 '16 at 20:19
@Brandon: Do you have files in the current directory that match the pattern? For example, `touch 012_abc_03a.jpg 345_def_14b.jpg` would create a couple of empty test files that the regex matches. In my answer, the regex match and its output should be part of an `if` statement instead of standing alone in order to avoid outputting empty results. I'll make that change in order to improve clarity. – Dennis Williamson Mar 14 '16 at 20:31
I'm not using files, I've adapted it to my scenario which is a svn log to retrieve text from a commit. I have the regex, it matches using grep but it doesn't match with =~. – Brandon Mar 16 '16 at 02:22
@Brandon: Without a specific example, I can't be of any help. Perhaps you should post the issue as its own question. – Dennis Williamson Mar 16 '16 at 16:18
@Brandon It does work and the example is easy enough to understand. In fact, it's a great answer. You're definitely doing something wrong. – mdelolmo Nov 22 '16 at 10:56
question was about GREP, not BASH – nmz787 Oct 25 '17 at 20:18
4

@mdelolmo: My answer includes information about `grep`. It was also accepted by the OP and upvoted quite a lot. Thanks for the downvote. – Dennis Williamson Oct 26 '17 at 16:51
@WylliamJudd You almost certainly have a typo. Which command does it say isn't found? What version of Bash? – Dennis Williamson Oct 26 '18 at 00:33
Fore example `regex="([A-z]+)\."` `"foo.bar"=~$regex` `-bash: foo.bar=~([A-z]+)\.: command not found` – Wylliam Judd Oct 26 '18 at 04:59
@WylliamJudd: There needs to be spaces around the `=~` and that expression needs to be within double brackets. In brace expansions such as in your second message it's not necessary to escape characters (the dot is always literal - that's a glob, not a regex). – Dennis Williamson Oct 26 '18 at 12:53
I did it both ways. `"foo.bar" =~ $regex` still gives `-bash: foo.bar: command not found`. `[["foo.bar" =~ $regex]]` also gives `-bash: [[foo.bar: command not found` Thanks for the tip on the superfluous escape. – Wylliam Judd Oct 26 '18 at 19:21
@WylliamJudd You _also_ need spaces inside the double brackets just as I show in my answer. – Dennis Williamson Oct 26 '18 at 19:24
Ah! I thought those double braces were the equivalent of parentheses for an if statement, and since I have no if statement in my use case, I didn't realize I needed them. Thanks! :) – Wylliam Judd Oct 26 '18 at 19:26
1

The part with grep and `\K` operator is great when you are using sh, not bash, for example in a buildroot package makefile. – Filip Kubicz Dec 21 '21 at 06:58
In such case, the BASH_REMATCH fails with `*** missing separator.` and you don't really know why, until you realize the buildsystem might be using sh. – Filip Kubicz Dec 21 '21 at 06:59

score 182 · Answer 2 · answered Dec 12 '09 at 01:26

182

This isn't really possible with pure grep, at least not generally.

But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).

Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:

echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'

The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.

(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).

Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)

answered Dec 12 '09 at 01:26

RobM

8,373
3
45
37

4

`for f in $files; do name=`echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'| cut -d _ -f 2`;` Aha! – Isaac Dec 12 '09 at 01:43
4

i disagree with that "philosophy". if you can use the shell's in built capabilities without calling external commands, then your script will be a lot faster in performance. there are some tools that overlap in function. eg grep and sed and awk. all of them does string manipulations, but awk stands out above them all because it can do a lot more. Practically, all those chaining of commands, like the above double greps or grep+sed can be shortened by doing them with one awk process. – ghostdog74 Dec 12 '09 at 04:43
10

@ghostdog74: No argument here that chaining lots of tiny operations together is generally less efficient than doing it all in one place, but I stand by my assertion that the Unix philosophy is lots of tools working together. For instance, tar just archives files, it doesn't compress them, and because it outputs to STDOUT by default you can pipe it across the network with netcat, or compress it with bzip2, etc. Which to my mind reinforces the convention and general ethos that Unix tools should be able to work together in pipes. – RobM Dec 13 '09 at 14:26
cut is awesome -- thanks for the tip! As for the tools vs efficiency argument, I like the simplicity of chaining tools. – ether_joe Oct 28 '14 at 23:00
props for grep's o option, that is very helpful – chiliNUT Jan 22 '17 at 05:02

score 123 · Answer 3 · answered Mar 03 '13 at 17:14

123

I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:

    echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
    name=$?

to the following:

    name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')

to get only the contents of the capturing group 1.

The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.

The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.

With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.

Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.

answered Mar 03 '13 at 17:14

John Sherwood

1,408
1
10
5

6

`pcregrep` is not available by default in `Mac OS X` which is what the OP uses – grebneke Jan 01 '14 at 02:06
5

My `pcregrep` doesn't seem to understand the digit after the `-o`: "Unknown option letter '1' in "-o1". Also no mention of that functionaliy when looking at `pcregrep --help` – Peter Herdenborg Mar 25 '15 at 09:10
i can't reproduce it. probably the impl of this `pcregrep` is different. could you provide more info? what about the difference between this and `grep -P`? not even in man page: http://linux.die.net/man/1/pcregrep – Jason Hu Jul 03 '15 at 14:02
@PeterHerdenborg What version are you using? I have the same issue and found reference to it [here](https://access.redhat.com/solutions/536243). – WAF Jul 20 '15 at 15:08
2

@WAF sorry, guess I should have included that info in my comment. I'm on Centos 6.5 and the pcregrep version is apparently very old: `7.8 2008-09-05`. – Peter Herdenborg Jul 31 '15 at 08:14
3

yeah, very help, e.g. `echo 'r123456 foo 2016-03-17' | pcregrep -o1 'r([0-9]+)' 123456 ` – zhuguowei Mar 17 '16 at 13:18
2

On macOS, `brew install pcre`. Also note that Homebrew's zsh depends on pcre, so you may already have pcre if you installed that. – anishpatel Oct 20 '17 at 22:26
7

`pcregrep` 8.41 (installed with `apt-get install pcregrep` on `Ubuntu 16.03`) doesn't recognize the `-Ei` switch. It works perfectly without it, though. On macOS, with `pcregrep` installed via `homebrew` (also 8.41) as @anishpatel mentions above, at least on High Sierra the `-E` switch is also not recognized. – Ville Feb 11 '18 at 22:56
Another handy option is `--om-separator=text` (only matching separator) which specifies a separating string. For example, `--om-separator=", "` separates matches by commas – Corey Dec 06 '22 at 04:40

cobbal · Answer 4 · 2009-12-12T01:17:33.700

41

Not possible in just grep I believe

for sed:

name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`

I'll take a stab at the bonus though:

echo "$name.jpg"

edited Dec 12 '09 at 01:17

answered Dec 12 '09 at 01:00

cobbal

69,903
20
143
156

5

Unfortunately, that `sed` solution doesn't work. It simply prints out everything in my directory. – Isaac Dec 12 '09 at 01:14
updated, will output a blank line if there isn't a match, so be sure to check for that – cobbal Dec 12 '09 at 01:19
It now outputs only blank lines! – Isaac Dec 12 '09 at 01:24
this sed has a problem. The first group of capturing parenthesis encompass everything. Of course \2 will have nothing. – ghostdog74 Dec 12 '09 at 04:36
it worked for some simple test cases... \2 gets the inner group – cobbal Dec 12 '09 at 06:01
NAILED THE "BONUS" ;) – mgalgs Jan 29 '14 at 20:09

score 20 · Answer 5 · answered Feb 03 '21 at 11:43

20

str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
    week="${BASH_REMATCH[1]}"
    day="${BASH_REMATCH[2]}"
    hour="${BASH_REMATCH[3]}"
    echo $week --- $day ---- $hour
fi

output: 1 --- 2 ---- 1

answered Feb 03 '21 at 11:43

chirag nayak

409
4
8

score 19 · Answer 6 · answered Jan 09 '13 at 06:37

19

This is a solution that uses gawk. It's something I find I need to use often so I created a function for it

function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }

to use just do

$ echo 'hello world' | regex1 'hello\s(.*)'
world

answered Jan 09 '13 at 06:37

opsb

29,325
19
89
99

Great idea, but does not seem to work with spaces in the regexp - they need to be replaced with `\s`. Do you know how to fix it? – Adam Ryczkowski Feb 16 '19 at 09:10
@opsb What is the purpose of - `ary['${2:-'1'}']}'`? Couldn't we have done the same thing with - `function regex1 { gawk 'match($0,/'$1'/, ary) {print ary[1]}'; }` to match the 1st occurrence? Edit: Never mind! got the intention. – Anis Jan 04 '23 at 11:33

score 6 · Answer 7 · answered Dec 12 '09 at 01:16

6

A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:

f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}

Then name will have the value abc.

See Apple developer docs, search forward for 'Parameter Expansion'.

answered Dec 12 '09 at 01:16

martin clayton

76,436
32
213
198

this will not check for ([a-z]+). – ghostdog74 Dec 12 '09 at 04:09
@levislevis - that's true, but, as commented by the OP, it does do what was needed. – martin clayton Dec 12 '09 at 05:18

towith · Answer 8 · 2020-08-31T07:25:11.943

4

I prefer the one line python or perl command, both often included in major linux disdribution

echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' |  python -c $'
import re
import sys
for i in sys.stdin:
  g=re.match(r\'.*href="(.*)"\',i);
  if g is not None:
    print g.group(1)
'

and to handle files:

ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
  i=i.strip()
  f=open(i,"r")
  for j in f:
    g=re.match(r\'.*href="(.*)"\',j);
    if g is not None:
      print g.group(1)
  f.close()
'

edited Aug 31 '20 at 07:25

answered Aug 25 '20 at 02:50

towith

109
1
3

2

+1 for the multiline python program, I feel like this is a fairly standard way of doing this on a lot of systems, that's also inline yet much more flexible than standard bash tools. – forumulator Dec 13 '20 at 01:32
Ever try doing this in Vim or Neovim? A well-used list comprehension makes it possible to populate your script with batched sh command lines. – David Golembiowski Sep 05 '22 at 02:53

Stephen Quan · Answer 9 · 2022-03-10T21:16:50.107

4

The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:

for f in 123_abc_123.jpg 123_xyz_432.jpg
do
    echo "f:    " $f
    name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
    echo "name: " $name
done

Outputs:

f:     123_abc_123.jpg
name:  abc
f:     123_xyz_432.jpg
name:  xyz

So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,

edited Mar 10 '22 at 21:16

answered Jun 15 '21 at 23:54

Stephen Quan

21,481
4
88
75

I wish I would have found this a week ago. Works great, thanks!!! – rtremaine Jun 16 '21 at 18:16

ghostdog74 · Answer 10 · 2009-12-12T04:12:25.940

3

if you have bash, you can use extended globbing

shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

or

ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

edited Dec 12 '09 at 04:12

answered Dec 12 '09 at 04:06

ghostdog74

327,991
56
259
343

That looks intriguing. Could you perhaps append a little explanation to it? Or, if you're so inclined, link to a particularly insightful resource that explains it? Thanks! – Isaac Dec 12 '09 at 04:14

Capturing Groups From a Grep RegEx

10 Answers10

Linked

Related