391

Is there a way to tell sed to output only captured groups?

For example, given the input:

This is a sample 123 text and some 987 numbers

And pattern:

/([\d]+)/

Could I get only 123 and 987 output in the way formatted by back references?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Pablo
  • 28,133
  • 34
  • 125
  • 215
  • 9
    Note, group capture requires `sed` to turn on extended regular expressions with the `-E` flag. – peterh Jun 17 '20 at 13:15
  • 5
    Also note, `sed -E` is for Max OSX and FreeBSD. If you are using a GNU distro (or in Git Bash or WSL), `sed -r` also works. If you're concerned about cross-platform compatibility, prefer `-E`. – mdhansen Aug 03 '21 at 20:09

12 Answers12

463

The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.

string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

This says:

  • use extended regular expressions (-r)
  • don't default to printing each line (-n)
  • exclude zero or more non-digits
  • include one or more digits
  • exclude one or more non-digits
  • include one or more digits
  • exclude zero or more non-digits
  • print the substitution (p) (on one line)

In general, in sed you capture groups using parentheses and output what you capture using a back reference:

echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'

will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:

echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'

There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:

echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'

outputs "a bar a".

If you have GNU grep:

echo "$string" | grep -Po '\d+'

It may also work in BSD, including OS X:

echo "$string" | grep -Eo '\d+'

These commands will match any number of digit sequences. The output will be on multiple lines.

or variations such as:

echo "$string" | grep -Po '(?<=\D )(\d+)'

The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.

Daniel Griscom
  • 1,834
  • 2
  • 26
  • 50
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • 33
    As a note, OSX Mountain Lion no longer supports PCRE in grep. – yincrash Aug 09 '12 at 15:20
  • 1
    As a side-note, grep -o option is not supported on Solaris 9. Also, Solaris 9 does not support the sed -r option. :( – Daniel Kats Oct 23 '12 at 15:42
  • 7
    Ask your sysadmin to install gsed. You'd be amazed at what a few donuts will get you... – avgvstvs Dec 11 '12 at 13:08
  • 1
    On OSX (including Mountain Lion) you can use `brew` to install `grep` from [homebrew-dupes](https://github.com/Homebrew/homebrew-dupes) and then use the (rather useful) `-P` option (: – drevicko Sep 02 '13 at 23:44
  • 3
    Note that you might need to prefix the '(' and ')' with '\', I don't know why. – lumbric May 09 '14 at 09:40
  • 9
    @lumbric: If you're referring to the `sed` example, if you use the `-r` option (or `-E` for OS X, IIRC) you don't need to escape the parentheses. The difference is that between basic regular expressions and extended regular expressions (`-r`). – Dennis Williamson May 09 '14 at 10:51
  • 2
    I found the accepted answer confusing b/c it incorporated a large regexp with the example, making it hard to extract the needed information: In sed you must escape parenthesis `\(.*\)`, access capture groups with `\1`, `\2`, ect.. – Noah Huppert May 20 '18 at 16:58
  • @NoahHuppert: You don't need to escape the parentheses if you use extended regex, as I have in my example, by including the `-r` option. I agree that I can highlight the capturing in my answer. I'll edit it accordingly. The reason the regex is large is because it implements the functionality that the OP was looking for in the Perl-style expression `\d` and the given input string. – Dennis Williamson May 21 '18 at 16:34
  • This works fine for me _**without**_ the `-n` and the `/p` — because you are substituting the whole string and outputting only `\1 \2` — so "exclude what you don't want" really isn't _the **key**_ – Stephen P Feb 27 '20 at 22:59
  • `sed` outputs the input instead of the empty string when there is no match, while `grep` works as expected. – Ziyuan Sep 20 '21 at 12:00
  • This answer relies on knowing how many numbers will occur on the line. The question specifically asks about getting this done with the capture group `([\d]+)`, meaning specifying a cluster of numbers and printing however many matches there may be on that line. – Myridium Jan 14 '22 at 03:51
  • @Myridium: I edited my answer to address that. – Dennis Williamson Jun 24 '22 at 15:55
  • `sed -E` also works on Linux. The manual says "for portability use POSIX -E". Maybe `-r` can be entirely replaced in this answer with `-E`. – Matthias Braun Jul 15 '23 at 19:51
57

Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.

See here for examples and more detail

Peter McG
  • 18,857
  • 8
  • 45
  • 53
  • 66
    `sed -e 's/version=\(.+\)/\1/' input.txt` this will still output the whole input.txt – Pablo May 06 '10 at 00:28
  • 3
    @Pablo, In your pattern you have to write `\+` instead of `+`. And I dont understand why people use `-e` for just one sed command. – Fredrick Gauss Nov 10 '17 at 12:23
  • 1
    use `sed -e -n 's/version=\(.+\)/\1/p' input.txt` see: https://www.mikeplate.com/2012/05/09/extract-regular-expression-group-match-using-grep-or-sed/ – awattar Apr 10 '18 at 09:43
  • 5
    I'd suggest using `sed -E` to use the so-called "modern" or "extended" regular expressions that look a lot closer to Perl/Java/JavaScript/Go/whatever flavors. (Compare to `grep -E` or `egrep`.) The default syntax has those strange escaping rules and is considered "obsolete". For more info on the differences between the two, run `man 7 re_format`. – AndrewF Nov 28 '18 at 03:51
39

you can use grep

grep -Eow "[0-9]+" file
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • 5
    @ghostdog74: Absolutely agree with you. How can I get greo to output only captured groups? – Pablo May 06 '10 at 01:24
  • 2
    @Michael - that's why the `o` option is there - http://unixhelp.ed.ac.uk/CGI/man-cgi?grep : -o, --only-matching Show only the part of a matching line that matches PATTERN – Bert F May 06 '10 at 11:36
  • 21
    @Bert F: I understand the matching part, but it's not capturing group. What I want is to have like this ([0-9]+).+([abc]{2,3}) so there are 2 capturing groups. I want to output ONLY capturing groups by backreferences or somehow else. – Pablo May 06 '10 at 12:11
  • Hello Michael. Did you managed to extract nth captured group by grep ? – doc_id Mar 14 '11 at 08:30
  • 2
    @Pablo: grep's only outputting what matches. To give it multiple groups, use multiple expressions: `grep -Eow -e "[0-9]+" -e "[abc]{2,3}"` I don't know how you could require those two expressions to be on one line aside from piping from a previous grep (which could still not work if either pattern matches more than once on a line). – idbrii Oct 03 '12 at 17:56
  • Also, you can't do `echo "a 10 b 12" | grep -Eo "a ([0-9]+)"` and get just the "10". But this works: `echo "a 10 b 12" | grep -Eo "a ([0-9]+)" | sed 's/a //'` – abalter May 15 '17 at 19:36
20

run(s) of digits

This answer works with any count of digit groups. Example:

$ echo 'Num123that456are7899900contained0018166intext' \
   | sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'

123 456 7899900 0018166

Expanded answer.

Is there any way to tell sed to output only captured groups?

Yes. replace all text by the capture group:

$ echo 'Number 123 inside text' \
   | sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'

123
s/[^0-9]*                           # several non-digits
         \([0-9]\{1,\}\)            # followed by one or more digits
                        [^0-9]*     # and followed by more non-digits.
                               /\1/ # gets replaced only by the digits.

Or with extended syntax (less backquotes and allow the use of +):

$ echo 'Number 123 in text' \
   | sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'

123

To avoid printing the original text when there is no number, use:

$ echo 'Number xxx in text' \
   | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
  • (-n) Do not print the input by default.
  • (/p) print only if a replacement was done.

And to match several numbers (and also print them):

$ echo 'N 123 in 456 text' \
  | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'

123 456

That works for any count of digit runs:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
   | sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'

123 456 7899900 0018166

Which is very similar to the grep command:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166

About \d

and pattern: /([\d]+)/

Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.

The selected answer use such "character classes" to build a solution:

$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

That solution only works for (exactly) two runs of digits.

Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:

$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"

But, as has been already explained, using a s/…/…/gp command is better:

$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987

That will cover both repeated runs of digits and writing a short(er) command.

  • Surprised after reading the high voted accepted answer, I scrolled down to write about its narrow scope and to actually address the spirit of the question. I should have guessed that someone would have done it years ago already. This is very well explained and is the true correct answer. – Amit Naidu May 19 '19 at 06:39
  • This is a little hacky and doesn't generalise well. The problem with this approach is that the pattern `[^0-9]*([0-9]+)[^0-9]*` needs to be designed in such a way that it never crosses the boundary of another match. That works OK for this example, but for complex search queries that don't work on a character-by-character basis, it isn't very practical to have to surround the actual desired match group `(whatever)` which its forward-lookup and reverse-lookup negation. – Myridium Jan 14 '22 at 04:16
  • It also needs to capture *everything* that is not part of the capture groups. – Myridium Jan 14 '22 at 04:29
13

Give up and use Perl

Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)

  • Print the entire matching part, no matching groups or lookbehind needed:

    cat <<EOS | perl -lane 'print m/\d+/g'
    a1 b2
    a34 b56
    EOS
    

    Output:

    12
    3456
    
  • Single match per line, often structured data fields:

    cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
    a1 b2
    a34 b56
    EOS
    

    Output:

    1
    34
    

    With lookbehind:

    cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
    a1 b2
    a34 b56
    EOS
    
  • Multiple fields:

    cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
    a1 c0 b2 c0
    a34 c0 b56 c0
    EOS
    

    Output:

    1 2
    34 56
    
  • Multiple matches per line, often unstructured data:

    cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
    a1 b2
    a34 b56 a78 b90
    EOS
    

    Output:

    1 
    34 78
    

    With lookbehind:

    cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
    a1 b2
    a34 b56 a78 b90
    EOS
    

    Output:

    1
    3478
    
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
10

I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.

If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:

> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers

These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)

Joseph Quinsey
  • 9,553
  • 10
  • 54
  • 77
  • @Joseph: thanks, however, based on my task I feel like grep is more natural, like ghostdog74 suggested. Just need to figure out how to make grep output the capture groups only, not the whole match. – Pablo May 06 '10 at 05:59
  • 2
    Just a note, but the plus sign '+' means 'one or more' which would remove the need for repeating yourself in the patterns. So, "[0-9][0-9]*" would become "[0-9]+" – RandomInsano Apr 12 '12 at 17:31
  • 5
    @RandomInsano: In order to use the `+`, you would need to escape it or use the `-r` option (`-E` for OS X). You can also use `\{1,\}` (or `-r` or `-E` without the escaping). – Dennis Williamson Apr 18 '12 at 22:02
5

You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:

echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
eskogh
  • 51
  • 1
  • 1
4

Try

sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

I got this under cygwin:

$ (echo "asdf"; \
   echo "1234"; \
   echo "asdf1234adsf1234asdf"; \
   echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
  sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

1234
1234 1234
1 2 3 4 5 6 7 8 9
$
Bert F
  • 85,407
  • 12
  • 106
  • 123
3

It's not what the OP asked for (capturing groups) but you can extract the numbers using:

S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'

Gives the following:

123
987
Thomas Bratt
  • 48,038
  • 36
  • 121
  • 139
2

I want to give a simpler example on "output only captured groups with sed"

I have /home/me/myfile-99 and wish to output the serial number of the file: 99

My first try, which didn't work was:

echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99

To make this work, we need to capture the unwanted portion in capture group as well:

echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99

*) Note that sed doesn't have \d

Sida Zhou
  • 3,529
  • 2
  • 33
  • 48
0

You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this

rg '(\d+)' -or '$1'

where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.

Patrick Häcker
  • 451
  • 4
  • 3
0

Rename all files called lesson${two_digits}.mp4 to lesson0${two_digits}.mp4

ls -d -- lesson[0-9][0-9].mp4 | sed "s/\(lesson\)\([0-9][0-9]\).mp4/mv \0 \1$!0\2.mp4/" | ash

For example files lesson11.mp4 and lesson50.mp4 would be renamed into lesson011 and lesson050

Jon Zuka
  • 11
  • 2