24

I have tried to extract a number as given below but nothing is printed on screen:

echo "This is an example: 65 apples" | sed -n  's/.*\([0-9]*\) apples/\1/p'

However, I get '65', if both digits are matched separately as given below:

echo "This is an example: 65 apples" | sed -n  's/.*\([0-9][0-9]\) apples/\1/p'
65

How can I match a number such that I don't know the number of digits in a number to be extracted e.g. it can be 2344 in place of 65?

Uthman
  • 9,251
  • 18
  • 74
  • 104

6 Answers6

29
$ echo "This is an example: 65 apples" | sed -r  's/^[^0-9]*([0-9]+).*/\1/'
65
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • 5
    +1, but beware that not all sed support -r and thus cannot use the '+' modifier and must escape the parens. – William Pursell Feb 13 '12 at 12:51
  • 3
    Why does a regex like `[([0-9]*) apple]`(http://sprunge.us/feGV) doesn't work in sed? It works just fine in python. – shadyabhi Feb 13 '12 at 12:54
  • so... ^[^0-9]* correspond to everything non-digit at the start of line. [0-9]+ to atleast one digit or more, right? – Uthman Feb 13 '12 at 12:55
  • 1
    @AbhijeetRastogi: Since we are using **substitution** we need to account for the entire line. Any part of the line not accounted for will be part of the output. This won't be the case if you are using pattern search (not substitution) as in your Python case. – codaddict Feb 13 '12 at 13:04
  • 1
    @codaddict Oops. My bad. Silly me. It's substitution. Thanks. – shadyabhi Feb 13 '12 at 13:25
6

It's because your first .* is greedy, and your [0-9]* allows 0 or more digits. Hence the .* gobbles up as much as it can (including the digits) and the [0-9]* matches nothing.

You can do:

echo "This is an example: 65 apples" | sed -n  's/.*\b\([0-9]\+\) apples/\1/p'

where I forced the [0-9] to match at least one digit, and also added a word boundary before the digits so the whole number is matched.

However, it's easier to use grep, where you match just the number:

echo "This is an example: 65 apples" | grep -P -o '[0-9]+(?= +apples)'

The -P means "perl regex" (so I don't have to worry about escaping the '+').

The -o means "only print the matches".

The (?= +apples) means match the digits followed by the word apples.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
3

A simple way for extracting all numbers from a string

echo "1213 test 456 test 789" | grep -P -o "\d+"

And the result:

1213
456
789
Khate
  • 339
  • 1
  • 3
  • 11
3

What you are seeing is the greedy behavior of regex. In your first example, .* gobbles up all the digits. Something like this does it:

echo "This is an example: 65144 apples" | sed -n  's/[^0-9]*\([0-9]\+\) apples/\1/p'
65144

This way, you can't match any digits in the first bit. Some regex dialects have a way to ask for non-greedy matching, but I don't believe sed has one.

FatalError
  • 52,695
  • 14
  • 99
  • 116
0

Now the rust tool ripgrep is a nice alternative. It is fast, runs on windows, linux and mac, and implements most of posix regex.

echo "This is an example: 65 apples" | rg '\d+' -o
65

The documentation for the -o option states:

-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

David
  • 355
  • 1
  • 9
0
echo "This is an example: 65 apples" | ssed -nR -e 's/.*?\b([0-9]*) apples/\1/p'

You will however need super-sed for this to work. The -R allows perl regexp.

ctrl-alt-delor
  • 7,506
  • 5
  • 40
  • 52