Regex for the first standalone number

Question

Suppose I have a string "hello12 54 world23 43"

What I want is the first standalone number (having space before and after) and not the one attached to a word. So, for the above string it would be 54, and not 12.

I tried

sed -r 's|^([^.]+).*$|\1|; s|^[^0-9]*([0-9]+).*$|\1|'

but that gives me 12 (the first number in the string). Can anyone help?

Note : Can only use sed

Try [`sed -E 's/^([0-9][^ ]|[^ ][0-9]|[^0-9])+ ([0-9]+) .*$/\2/'`](https://ideone.com/j7W2E8) — Wiktor Stribiżew, Sep 27 '16 at 11:11
Yes, in mine, actually, the approach is correct, but the alternation branches are not correct. Whoever comes up with a fix deserves an upvote or two :) — Wiktor Stribiżew, Sep 27 '16 at 11:34
That is a GNU sed option enabling extended POSIX syntax so as not to escape parentheses, alternation operator, plus quantifier... — Wiktor Stribiżew, Sep 27 '16 at 16:43
Shouldn't it be in [docs](https://www.gnu.org/software/sed/manual/sed.html)? @WiktorStribiżew — revo, Sep 27 '16 at 16:46
Well, it depends, there are various versions of sed around. Look [here](http://www.grymoire.com/Unix/Sed.html#uh-62k). — Wiktor Stribiżew, Sep 27 '16 at 17:12
I saw this page before I comment it here. It's not GNU sed option however. It has a very limited availability to those two operating systems' `sed`s only and I was surprised why you offered such an option that may not work on others operating systems in the first place. So I was trying to find it out. Also it's right that your regex has problems but couldn't be called a *bug* as you referred to it. It's just the way it works. Your regex simply translates to `^.+ ([0-9]+) .*` on input strings that have numbers with more than 1 digit. @WiktorStribiżew — revo, Sep 27 '16 at 18:00
Happy new regex101.com! But I don't like it so much I prefer previous site simple flagging feature more. @WiktorStribiżew — revo, Sep 27 '16 at 18:00

hek2mgl · Answer 1 · 2016-09-27T13:09:35.250

3

Having GNU awk you can use the following command:

awk 'NF {print $1}' FPAT='\\y[[:digit:]]+\\y' file

The key here is the use of the FPAT variable which is a GNU extension. FPAT stands for field pattern and describes what is a field. In our case we want a field to be a number "enclosed" within word boundaries (\y, needs to get doubly escaped because it appears in a shell string).

NF {print $1} checks if the first field (number) exists; in that case the number of fields (NF) is greater than zero. If that's the case the first field will get printed.

Btw, probably your sed is able to do this?

echo "hello12 54 world23 43 " \
    | sed 's/\(\b\|^\)\([0-9]\{1,\}\)\(\b\|$\)/\n\2\n/' \
    | sed '/^[0-9]\{1,\}$/!d' \
    | sed '1!d'

Sorry I can only guess if you can't say the exact sed version.

The first sed command extracts numbers that stands alone on separate lines. The second sed command deletes all lines which do not consist of a number only and the last one deletes everything except of the first line - if it exists.

edited Sep 27 '16 at 13:09

answered Sep 27 '16 at 11:10

hek2mgl

152,036
28
249
266

I really wanted the answer in `sed` – Haris Sep 27 '16 at 11:30
1

Good and clean one! You can replace the `$1 != ""` for a simple `NF`, since you are checking if the is at least one field. – fedorqui Sep 27 '16 at 11:34
1

Not sure the `\|` operator is posix compatible. – SLePort Sep 27 '16 at 12:28
Didn't test it but it is the dirtiest solution I could have ever seen. Nice job – revo Sep 27 '16 at 12:57
@Kenavoz You are right, the alternation operator is defined only for extended posix regular expressions, while the `-r` flag of `sed` is not POSIX. sigh :) ... I'll keep it know as it is, since I don't even know if OP's version of `sed` is POSIX compatible. – hek2mgl Sep 27 '16 at 13:08

score 1 · Answer 2 · edited May 23 '17 at 12:00

1

It is probably cleaner to use grep with \b:

$ grep -Eo '\b[0-9]+\b' <<< "hello12 54 world23 43"
54
43

Note this shows all the matches, so you may want to pipe to head -1 to get just the first one.

From GNU grep → 3.3 The Backslash Character and Special Expressions:

‘\b’
Match the empty string at the edge of a word.

~~If you really need sed:~~

$ sed -r 's/.*?\b([0-9]+)\b.*/\1/' <<< "hello12 54 world23 43"
54
$ sed -r 's/.*?\b([0-9]+)\b.*/\1/' <<< "54 world23 43"
54

This catches the first block of [0-9]+ that occurs in a given line that constitutes a word itself. Then, it prints it back.

Removed since sed does not recognize the .*? non greedy regex matching.

edited May 23 '17 at 12:00

Community

1
1

answered Sep 27 '16 at 11:09

fedorqui

275,237
103
548
598

Apologies, but I cannot use `grep`. My environment does not have `grep` functionality – Haris Sep 27 '16 at 11:10
1

@Haris uhms, what environment does have sed but not grep? – fedorqui Sep 27 '16 at 11:11
@Haris anyway, see my updated answer with a `sed` approach. – fedorqui Sep 27 '16 at 11:12
Your RE is giving me this error `RE error: repetition-operator operand invalid` – Haris Sep 27 '16 at 11:15
A private enterprise one :p – Haris Sep 27 '16 at 11:18
@Haris you should update your question indicating what strange environment you are working on. `sed --version` and the shell you are using would help. – fedorqui Sep 27 '16 at 11:18
1

@fedorqui Try `sed -r 's/.*?\b([0-9]+)\b.*/\1/' <<< "hello12 54 world23 43 "` – hek2mgl Sep 27 '16 at 11:20
@Haris *what* version are you using? Note that SO is *not* funny! – hek2mgl Sep 27 '16 at 11:23
@hek2mgl good one! I did not know that [`sed` does not recognize the `.*?` non greedy regex matching](http://stackoverflow.com/a/1103177/1983854) :/ – fedorqui Sep 27 '16 at 11:25
@hek2mgl, well, it does not have support for some options, and `--version` is one of them. And sorry, I said the right thing, with a sarcastic tinge. – Haris Sep 27 '16 at 11:25
@Haris Which system does not ship with `grep` but with an ancient version of `sed` ? You'll need to provide more information, otherwise nobody can help. Btw, sarcasm was true for my previous comment as well. – hek2mgl Sep 27 '16 at 11:31
@hek2mgl, I would If I had. I cannot determine the version, and giving the name of the system won't help for sure.. – Haris Sep 27 '16 at 11:33

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

Update #2

Your regex has redundant parts that you could remove them. E.g s|^([^.]+).*$|\1| that does replace a line with itself. If you are sure there is only one number as such in your string below regex is enough otherwise check the other solutions to capture the first one:

sed -r "s/^.* ([0-9]+) .*/\1/"

Simulating lazy version (preferred way):

POSIX ERE (using -r option)

This works like greedy version except it is a must if your string may have more than one occurrence of such numbers.

Regex:

 ([0-9]+) .*|.

Usage:

$ sed -r "s/ ([0-9]+) .*|./\1/g" <<< " 54 foo 43 "

POSIX BRE

If you want to go with the oldest regex flavor still in use (POSIX BRE) then this is your choice. This works the same as above regex but written in BRE.

Regex:

\(\( \([0-9]*\) .*\)*.\)*

Usage:

$ sed "s/\(\( \([0-9]*\) .*\)*.\)*/\3/g" <<< " 54 foo 43 "

In lazy versions, global g modifier should be set.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 27 '16 at 11:16

revo

47,783
14
74
117

The input string `"54 foo 43"` breaks both solutions. – hek2mgl Sep 27 '16 at 11:46
Re-check OP *(having space before and after)* @hek2mgl – revo Sep 27 '16 at 11:47
@revo `" 54 foo 43 "` breaks the first solution as well. – hek2mgl Sep 27 '16 at 11:51
No it doesn't. @hek2mgl – revo Sep 27 '16 at 11:52
I think you don't use `-r` option with my regex version. @hek2mgl – revo Sep 27 '16 at 11:52
I'm using `sed -r "s/^.* ([0-9]+) .*/\1/" <<< " 54 foo 43 "`. The output is `43`. Btw, the `+` quantifier is *not* POSIX. – hek2mgl Sep 27 '16 at 11:57
You should use second approach of mine `sed -r "s/ ([0-9]+) .*|./\1/g" <<< " 54 foo 43 "`. `+` doesn't need to be escaped when `-r` option is used. @hek2mgl – revo Sep 27 '16 at 12:11
I'm sure you didn't read my answer entirely. That's why you couldn't find which approach, when should be used. @hek2mgl – revo Sep 27 '16 at 12:12
Looks like I was bit confused by the first version(s) - likely because I did not read it carefully enough. The `([0-9]+) .*|.` trick is nice, but a bit limited since it would not work with an input string like `"54 foo 43"`. Still +1 since it answers *this* question. – hek2mgl Sep 27 '16 at 17:17
1

Again I'd say it is OP's definition of a standalone number: *(having space before and after)*, not mine. But also if you like to care about numbers at the end or beginning of a string there is no problem: `sed -r "s/(^| )([0-9]+)( .*|$)|./\2/g" <<< "54 foo 43"` @hek2mgl – revo Sep 27 '16 at 17:31
1

I'd ask down-voter to explain what's wrong with this answer to let me improve it. – revo Sep 27 '16 at 23:06

Regex for the first standalone number

3 Answers3

Update #2