Matching arbitrary number of digits using grep regex

Question

I've got a file that has lines in it that look similar as follows

data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later

What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:

^[D,d]ata[0-9]*later$

However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.

Tom · Accepted Answer · 2013-02-17T21:59:09.047

11

Use + instead of *.

+ matches at least one or more of the preceding.
* matches zero or more.

^[Dd]ata[0-9]+later$

In grep you need to escape the +, and we can use \d which is a character class and matches single digits.

^[Dd]ata\d\+later$

In you example file you also have a line:

datafhj893724897290384later

This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.

Our final command will be:

grep '^[Dd]ata[^0-9]*\d\+later$' filename

edited Feb 17 '13 at 21:59

answered Feb 17 '13 at 21:37

Tom

15,798
4
37
48

When using this expression, or @Eric, I get no results on output. Here's what I am using: grep ^[D,d]ata[0-9]+later$ filename – hdub Feb 17 '13 at 21:49
Still no dice with this even as a copy/paste. – hdub Feb 17 '13 at 22:17
The file contents to have whitespace/line breaks as well `$ cat test2 datadata datalater data98349248later datadhsd90834092823later` – hdub Feb 17 '13 at 22:24
If there are whitespaces could you update your example file in the question so I can update the regex. It currently works for the examples you have provided. – Tom Feb 17 '13 at 23:32
8 years later, but the tidbit about the + needing be escaped is gold (and quite unintuitive to find by trial and error in a "I need to do this but Linux is not my native environment" situation). – Jostikas Nov 24 '21 at 06:16
`\d` is a Perl extension which is generally not portable. Some `grep`s support it, but the POSIX-portable solution is `[[:digit:]]` or simply `[0-9]` if you don't care about locale variations etc. – tripleee May 31 '22 at 18:41

score 3 · Answer 2 · answered Aug 20 '16 at 01:14

3

Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.

$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL

I always like to make sure my file has what I expect it to have:

$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later

$

I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity @Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.

$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later

$

I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.

System Info

$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin

My Results from the previous answers

$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt

$ grep '^[Dd]ata\d+later$' file2.txt

$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt

$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later

$

answered Aug 20 '16 at 01:14

bballdave025

1,347
1
15
28

1

your answer helped me as well though I was using MinGW. according to https://stackoverflow.com/questions/771756/what-is-the-difference-between-cygwin-and-mingw the git-bash seems "It depends on the MSYS DLL, which is a fork of the Cygwin DLL" which would explain everything – aldr Sep 27 '19 at 11:16
This is confused: the backslashed `\+` doesn't make sense with `-P` and `\d` doesn't make sense without it. – tripleee May 31 '22 at 18:42
@tripleee , I absolutely agree that it's confused. Going from an archived copy of this Q&A - available from the WaybackMachine link in my next comment - I see the following (user, final-code) pairs: { (@Tom_Cammann, `grep '^[Dd]ata[^0-9]*\d\+later$' filename`), (@Eric_Galluzzo, `^[Dd]ata\d+later$`) }. I simply tried them all as they appeared in the answers, each with and without the `-P` flag. I imagine the differences might be due to different versions (e.g. a Cygwin version of `grep`), or to `egrep`, `fgrep`, ..., or even something like `alias grep='grep -P'`. If you know, please elucidate. – bballdave025 Jun 14 '22 at 00:40
https://web.archive.org/web/20220614001142/https://stackoverflow.com/questions/14926332/matching-arbitrary-number-of-digits-using-grep-regex – bballdave025 Jun 14 '22 at 00:42
Now, with 6 years more experience, I would do something like `grep '^[Dd]ata[^0-9]*[0-9]\+later$'` or `grep -P '^[Dd]ata[^\d]*\d+later$' dfile`, but I still don't know why neither of the other two answers worked on Cygwin. – bballdave025 Jun 14 '22 at 00:46

score 2 · Answer 3 · answered Feb 17 '13 at 21:41

2

You're matching zero or more digits with the * qualifier. Try

^[Dd]ata\d+later$

instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.

answered Feb 17 '13 at 21:41

Eric Galluzzo

3,191
1
20
20

I wish this showed output, but doesn't yield any results. See my response to @Tom – hdub Feb 17 '13 at 21:51

score 2 · Answer 4 · answered Mar 10 '20 at 18:47

The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.

To use extended-regexp, run egrep or pass "-E" / "--extended-regexp" Examples...

Standard grep

echo abc123n1  | grep "abc[0-9]+n1"
<no output>

egrep

echo abc123n1  | egrep "abc[0-9]+n1"
abc123n1

grep with -E

echo abc123n1  | grep -E "abc[0-9]+n1"
abc123n1

HTH

score 1 · Answer 5 · answered Feb 17 '13 at 21:39

1

You should put a "+" (which means one or several) instead of "*" (which means zero, one or several

answered Feb 17 '13 at 21:39

Fafhrd

436
2
6

oOps, Tom answered while I was writing an answer among several things, he got it ! – Fafhrd Feb 17 '13 at 21:41

score -1 · Answer 6 · answered Oct 29 '21 at 03:24

MOTIVATION

The rest of answers don't work on all systems.

️ REQUISITES

grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$

COMMAND

grep --extended-regexp "[[:group:]]+"

️ GROUPS

alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit

score -1 · Answer 7 · edited Aug 29 '22 at 22:01

-1

grep -Eio "^(data)[0-9]+(later)$"

^[dD]ata=^d later$=r$

edited Aug 29 '22 at 22:01

RusArtM

1,116
3
15
22

answered Aug 24 '22 at 00:14

user19833197

1
1

Matching arbitrary number of digits using grep regex

7 Answers7

Linked