8

I've got a file that has lines in it that look similar as follows

data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later

What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:

^[D,d]ata[0-9]*later$ 

However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.

jww
  • 97,681
  • 90
  • 411
  • 885
hdub
  • 171
  • 2
  • 3
  • 9

7 Answers7

11

Use + instead of *.

+ matches at least one or more of the preceding.
* matches zero or more.

^[Dd]ata[0-9]+later$

In grep you need to escape the +, and we can use \d which is a character class and matches single digits.

^[Dd]ata\d\+later$

In you example file you also have a line:

datafhj893724897290384later

This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.

Our final command will be:

grep '^[Dd]ata[^0-9]*\d\+later$' filename
Tom
  • 15,798
  • 4
  • 37
  • 48
  • When using this expression, or @Eric, I get no results on output. Here's what I am using: grep ^[D,d]ata[0-9]+later$ filename – hdub Feb 17 '13 at 21:49
  • Still no dice with this even as a copy/paste. – hdub Feb 17 '13 at 22:17
  • The file contents to have whitespace/line breaks as well `$ cat test2 datadata datalater data98349248later datadhsd90834092823later` – hdub Feb 17 '13 at 22:24
  • If there are whitespaces could you update your example file in the question so I can update the regex. It currently works for the examples you have provided. – Tom Feb 17 '13 at 23:32
  • 8 years later, but the tidbit about the + needing be escaped is gold (and quite unintuitive to find by trial and error in a "I need to do this but Linux is not my native environment" situation). – Jostikas Nov 24 '21 at 06:16
  • `\d` is a Perl extension which is generally not portable. Some `grep`s support it, but the POSIX-portable solution is `[[:digit:]]` or simply `[0-9]` if you don't care about locale variations etc. – tripleee May 31 '22 at 18:41
3

Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.

$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL

I always like to make sure my file has what I expect it to have:

$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later

$

I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity @Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.

$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later

$

I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.

System Info

$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin

My Results from the previous answers

$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt

$ grep '^[Dd]ata\d+later$' file2.txt

$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt

$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later

$
bballdave025
  • 1,347
  • 1
  • 15
  • 28
  • 1
    your answer helped me as well though I was using MinGW. according to https://stackoverflow.com/questions/771756/what-is-the-difference-between-cygwin-and-mingw the git-bash seems "It depends on the MSYS DLL, which is a fork of the Cygwin DLL" which would explain everything – aldr Sep 27 '19 at 11:16
  • This is confused: the backslashed `\+` doesn't make sense with `-P` and `\d` doesn't make sense without it. – tripleee May 31 '22 at 18:42
  • @tripleee , I absolutely agree that it's confused. Going from an archived copy of this Q&A - available from the WaybackMachine link in my next comment - I see the following (user, final-code) pairs: { (@Tom_Cammann, `grep '^[Dd]ata[^0-9]*\d\+later$' filename`), (@Eric_Galluzzo, `^[Dd]ata\d+later$`) }. I simply tried them all as they appeared in the answers, each with and without the `-P` flag. I imagine the differences might be due to different versions (e.g. a Cygwin version of `grep`), or to `egrep`, `fgrep`, ..., or even something like `alias grep='grep -P'`. If you know, please elucidate. – bballdave025 Jun 14 '22 at 00:40
  • https://web.archive.org/web/20220614001142/https://stackoverflow.com/questions/14926332/matching-arbitrary-number-of-digits-using-grep-regex – bballdave025 Jun 14 '22 at 00:42
  • Now, with 6 years more experience, I would do something like `grep '^[Dd]ata[^0-9]*[0-9]\+later$'` or `grep -P '^[Dd]ata[^\d]*\d+later$' dfile`, but I still don't know why neither of the other two answers worked on Cygwin. – bballdave025 Jun 14 '22 at 00:46
2

You're matching zero or more digits with the * qualifier. Try

^[Dd]ata\d+later$

instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.

Eric Galluzzo
  • 3,191
  • 1
  • 20
  • 20
2

The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.

To use extended-regexp, run egrep or pass "-E" / "--extended-regexp" Examples...

Standard grep

echo abc123n1  | grep "abc[0-9]+n1"
<no output>

egrep

echo abc123n1  | egrep "abc[0-9]+n1"
abc123n1

grep with -E

echo abc123n1  | grep -E "abc[0-9]+n1"
abc123n1

HTH

Shewbs
  • 21
  • 1
1

You should put a "+" (which means one or several) instead of "*" (which means zero, one or several

Fafhrd
  • 436
  • 2
  • 6
-1

MOTIVATION

The rest of answers don't work on all systems.


️ REQUISITES

  • grep
  • The option: --extended-regexp
  • Character groups, aka: [:group:]
  • Matching one or more of the preceding, aka: +
  • Optionally setting as starting or ending: ^whatever$

COMMAND

grep --extended-regexp "[[:group:]]+"


️ GROUPS

  • alnum
  • alpha
  • blank
  • cntrl
  • digit
  • graph
  • lower
  • print
  • punct
  • space
  • upper
  • xdigit
-1
grep -Eio "^(data)[0-9]+(later)$"

^[dD]ata=^d later$=r$
RusArtM
  • 1,116
  • 3
  • 15
  • 22