How to extract text from a string using sed?

Question

My example string is as follows:

This is 02G05 a test string 20-Jul-2012

Now from the above string I want to extract 02G05. For that I tried the following regex with sed

$ echo "This is 02G05 a test string 20-Jul-2012" | sed -n '/\d+G\d+/p'

But the above command prints nothing and the reason I believe is it is not able to match anything against the pattern I supplied to sed.

So, my question is what am I doing wrong here and how to correct it.

When I try the above string and pattern with python I get my result

>>> re.findall(r'\d+G\d+',st)
['02G05']
>>>

Python is definitely not `sed`. Their regex flavors are quite different. — tripleee, Dec 12 '13 at 11:45

mVChr · Answer 1 · 2020-05-29T17:21:30.897

134

How about using grep -E?

echo "This is 02G05 a test string 20-Jul-2012" | grep -Eo '[0-9]+G[0-9]+'

edited May 29 '20 at 17:21

answered Jul 19 '12 at 20:42

mVChr

49,587
11
107
104

4

+1 This is simpler, and will also correctly handle the case of multiple matches on the same line. A complex `sed` script could be devised for that case, but why bother? – tripleee Jul 20 '12 at 07:28
`egrep` uses extended regexp, `sed` and `grep` uses standard regexp, `egrep` or `grep -e` or `sed -E` use extended regexp, and the python code in the question uses PCRE, (perl common regular expression) GNU grep can use PCRE with `-P` option. – Felipe Buccioni Aug 22 '16 at 13:46
@FelipeBuccioni actually that should be `egrep` or `grep -E` or `sed -r` – SensorSmith Apr 13 '18 at 15:44
For a single(first) match, append ` | head -1` (without backticks), as per [this answer](https://stackoverflow.com/a/14093511/3610458) to another question. – SensorSmith Apr 13 '18 at 15:55
@SensorSmith Some `sed` implementations use `-r`, others use `-E`; still others don't have an option to change the regex dialect. – tripleee Apr 20 '18 at 03:41
2

`grep` has `-m 1` to stop after the first match. – tripleee Apr 20 '18 at 03:42
Thanks a ton. Finally a simple and elegant solution than `grep` / `awk` / `sed` – Sunny Tambi May 29 '20 at 11:40
This doesn't handle multiple lines though. – MattSt Jan 19 '23 at 08:51

score 128 · Accepted Answer · answered Jul 19 '12 at 20:39

128

The pattern \d might not be supported by your sed. Try [0-9] or [[:digit:]] instead.

To only print the actual match (not the entire matching line), use a substitution.

sed -n 's/.*\([0-9][0-9]*G[0-9][0-9]*\).*/\1/p'

answered Jul 19 '12 at 20:39

tripleee

175,061
34
275
318

6

Thanks it worked fine. But I have a question why `.*` is necessary with your regex because when I try `sed -n 's/\([0-9]\+G[0-9]\+\)/\1/p'` it just prints the entire line. – RanRag Jul 19 '12 at 20:47
7

That's why, isn't it? Replace whatever comes before and after the match with norhing, then print the whole line. – tripleee Jul 19 '12 at 21:01
1

@tripleee This only prints `2G05` not `02G05`. The expression that works is `'s/.*\([0-9][0-9]G[0-9][0-9]*\).*/\1/p'` – Kshitiz Sharma Dec 12 '13 at 10:06
1

That hard-codes it to exactly two digits. Something like `sed -n 's/\(.*[^0-9]\)\?\([0-9][0-9]*G[0-9][0-9]*\).*/\2/p'` would be more general. (I assume your `sed` supports `\?` for zero or one occurrence.) – tripleee Dec 12 '13 at 11:53
See also https://stackoverflow.com/a/48898886/874188 for how to replace various other common Perl escapes like `\w`, `\s`, etc. – tripleee Aug 16 '19 at 05:28
@tripleee your "to only print the actual match...." was the pointer i needed for what i was trying to do – northern-bradley Mar 06 '20 at 20:29
@tripleee what do you want to show with `sed -n 's/\(.*[^0-9]\)\?\([0-9][0-9]*G[0-9][0-9]*\).*/\2/p'`? This is confusing, `\1` is not used. – Timo May 27 '20 at 11:53
Why is it confusing? I discard whatever the first group matches. The `\?` makes it optional (so it could be empty) but if there is anything before the number, we remove it. – tripleee May 27 '20 at 12:03

score 8 · Answer 3 · answered Jul 19 '12 at 20:40

8

Try this instead:

echo "This is 02G05 a test string 20-Jul-2012" | sed 's/.* \([0-9]\+G[0-9]\+\) .*/\1/'

But note, if there is two pattern on one line, it will prints the 2nd.

answered Jul 19 '12 at 20:40

Zsolt Botykai

50,406
14
85
110

Or more generally the last one if there are multiple matches. – tripleee Jul 19 '16 at 13:28

score 6 · Answer 4 · answered Jul 19 '12 at 20:37

6

sed doesn't recognize \d, use [[:digit:]] instead. You will also need to escape the + or use the -r switch (-E on OS X).

Note that [0-9] works as well for Arabic-Hindu numerals.

answered Jul 19 '12 at 20:37

Dennis Williamson

346,391
90
374
439

I tried `sed -n '/[0-9]\+G[0-9]\+/p'`. Now it just prints the whole string – RanRag Jul 19 '12 at 20:43
@Noob: You will need to use substitution to [exclude the parts you don't want to print](http://stackoverflow.com/questions/2777579/sed-group-capturing/2778096#2778096). – Dennis Williamson Jul 19 '12 at 20:46

score 0 · Answer 5 · edited Aug 22 '18 at 16:28

0

Try using rextract. It will let you extract text using a regular expression and reformat it.

Example:

$ echo "This is 02G05 a test string 20-Jul-2012" | ./rextract '([\d]+G[\d]+)' '${1}'

2G05

edited Aug 22 '18 at 16:28

Geoff

7,935
3
35
43

answered Sep 13 '16 at 03:03

Tim Savannah

19
2

If this uses standard regex, the square brackets around `\d` are completely superfluous. – tripleee Nov 26 '19 at 06:16

score 0 · Answer 6 · answered Mar 17 '23 at 22:20

We can use sed -En to simplify the regular expression, where:

n: suppress automatic printing of pattern space
E: use extended regular expressions in the script

$ echo "This is 02G05 a test string 20-Jul-2012" | sed -En 's/.*([0-9][0-9]+G[0-9]+).*/\1/p'

02G05

How to extract text from a string using sed?

6 Answers6

Linked

Related