122

My example string is as follows:

This is 02G05 a test string 20-Jul-2012

Now from the above string I want to extract 02G05. For that I tried the following regex with sed

$ echo "This is 02G05 a test string 20-Jul-2012" | sed -n '/\d+G\d+/p'

But the above command prints nothing and the reason I believe is it is not able to match anything against the pattern I supplied to sed.

So, my question is what am I doing wrong here and how to correct it.

When I try the above string and pattern with python I get my result

>>> re.findall(r'\d+G\d+',st)
['02G05']
>>>
RanRag
  • 48,359
  • 38
  • 114
  • 167

6 Answers6

134

How about using grep -E?

echo "This is 02G05 a test string 20-Jul-2012" | grep -Eo '[0-9]+G[0-9]+'
mVChr
  • 49,587
  • 11
  • 107
  • 104
  • 4
    +1 This is simpler, and will also correctly handle the case of multiple matches on the same line. A complex `sed` script could be devised for that case, but why bother? – tripleee Jul 20 '12 at 07:28
  • `egrep` uses extended regexp, `sed` and `grep` uses standard regexp, `egrep` or `grep -e` or `sed -E` use extended regexp, and the python code in the question uses PCRE, (perl common regular expression) GNU grep can use PCRE with `-P` option. – Felipe Buccioni Aug 22 '16 at 13:46
  • @FelipeBuccioni actually that should be `egrep` or `grep -E` or `sed -r` – SensorSmith Apr 13 '18 at 15:44
  • For a single(first) match, append ` | head -1` (without backticks), as per [this answer](https://stackoverflow.com/a/14093511/3610458) to another question. – SensorSmith Apr 13 '18 at 15:55
  • @SensorSmith Some `sed` implementations use `-r`, others use `-E`; still others don't have an option to change the regex dialect. – tripleee Apr 20 '18 at 03:41
  • 2
    `grep` has `-m 1` to stop after the first match. – tripleee Apr 20 '18 at 03:42
  • Thanks a ton. Finally a simple and elegant solution than `grep` / `awk` / `sed` – Sunny Tambi May 29 '20 at 11:40
  • This doesn't handle multiple lines though. – MattSt Jan 19 '23 at 08:51
128

The pattern \d might not be supported by your sed. Try [0-9] or [[:digit:]] instead.

To only print the actual match (not the entire matching line), use a substitution.

sed -n 's/.*\([0-9][0-9]*G[0-9][0-9]*\).*/\1/p'
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 6
    Thanks it worked fine. But I have a question why `.*` is necessary with your regex because when I try `sed -n 's/\([0-9]\+G[0-9]\+\)/\1/p'` it just prints the entire line. – RanRag Jul 19 '12 at 20:47
  • 7
    That's why, isn't it? Replace whatever comes before and after the match with norhing, then print the whole line. – tripleee Jul 19 '12 at 21:01
  • 1
    @tripleee This only prints `2G05` not `02G05`. The expression that works is `'s/.*\([0-9][0-9]G[0-9][0-9]*\).*/\1/p'` – Kshitiz Sharma Dec 12 '13 at 10:06
  • 1
    That hard-codes it to exactly two digits. Something like `sed -n 's/\(.*[^0-9]\)\?\([0-9][0-9]*G[0-9][0-9]*\).*/\2/p'` would be more general. (I assume your `sed` supports `\?` for zero or one occurrence.) – tripleee Dec 12 '13 at 11:53
  • See also https://stackoverflow.com/a/48898886/874188 for how to replace various other common Perl escapes like `\w`, `\s`, etc. – tripleee Aug 16 '19 at 05:28
  • @tripleee your "to only print the actual match...." was the pointer i needed for what i was trying to do – northern-bradley Mar 06 '20 at 20:29
  • @tripleee what do you want to show with `sed -n 's/\(.*[^0-9]\)\?\([0-9][0-9]*G[0-9][0-9]*\).*/\2/p'`? This is confusing, `\1` is not used. – Timo May 27 '20 at 11:53
  • Why is it confusing? I discard whatever the first group matches. The `\?` makes it optional (so it could be empty) but if there is anything before the number, we remove it. – tripleee May 27 '20 at 12:03
8

Try this instead:

echo "This is 02G05 a test string 20-Jul-2012" | sed 's/.* \([0-9]\+G[0-9]\+\) .*/\1/'

But note, if there is two pattern on one line, it will prints the 2nd.

Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
6

sed doesn't recognize \d, use [[:digit:]] instead. You will also need to escape the + or use the -r switch (-E on OS X).

Note that [0-9] works as well for Arabic-Hindu numerals.

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • I tried `sed -n '/[0-9]\+G[0-9]\+/p'`. Now it just prints the whole string – RanRag Jul 19 '12 at 20:43
  • @Noob: You will need to use substitution to [exclude the parts you don't want to print](http://stackoverflow.com/questions/2777579/sed-group-capturing/2778096#2778096). – Dennis Williamson Jul 19 '12 at 20:46
0

Try using rextract. It will let you extract text using a regular expression and reformat it.

Example:

$ echo "This is 02G05 a test string 20-Jul-2012" | ./rextract '([\d]+G[\d]+)' '${1}'

2G05
Geoff
  • 7,935
  • 3
  • 35
  • 43
0

We can use sed -En to simplify the regular expression, where:

n: suppress automatic printing of pattern space
E: use extended regular expressions in the script
$ echo "This is 02G05 a test string 20-Jul-2012" | sed -En 's/.*([0-9][0-9]+G[0-9]+).*/\1/p'

02G05
aotherix
  • 1
  • 2