6

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:

<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>

I wanted to get the identifier part after the slash and constructed a regex using RegexPal:

[a-z]\d{4}[a-z]*\.[a-z]*\d*

It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.

grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml

What am I doing wrong?

slhck
  • 36,575
  • 28
  • 148
  • 201

6 Answers6

13

Your regex doesn't match the input. Let's break it down:

  • [a-z] matches g
  • \d{4} matches 1234
  • [a-z]* doesn't match .

Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]

Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.

Meekohi
  • 10,390
  • 6
  • 49
  • 58
Jon
  • 16,212
  • 8
  • 50
  • 62
  • Had a typo there, sorry. It's corrected now. And the regex still matches in the online tool. – slhck Nov 16 '10 at 09:46
  • Thanks a lot! `egrep` rocks. – naXa stands with Ukraine Jan 15 '16 at 12:20
  • 1
    I'm confused. `[a-z]*` doesn't match `.` but it is optional so it matches 0 characters and then the next bit of the regex does match the dot. And that's why it works on the regex tester site. I think the actual problem is using extended regex like you suggested. – Jerry Jeremiah Dec 15 '20 at 21:37
6

grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:

$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml

or:

$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
Kobi
  • 135,331
  • 41
  • 252
  • 292
3

grep uses "basic" regular expressions : (excerpt from man pages )

Basic vs Extended Regular Expressions
   In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
   special meaning; instead use the backslashed versions \?, \+, \{,  \|,  \(,  and
   \).

   Traditional  egrep  did  not  support  the  {  meta-character,  and  some  egrep
   implementations support \{ instead,  so  portable  scripts  should  avoid  {  in
   grep -E patterns and should use [{] to match a literal {.

   GNU  grep -E  attempts  to  support  traditional usage by assuming that { is not
   special if it would be the start of  an  invalid  interval  specification.   For
   example,  the  command  grep -E '{1'  searches  for  the two-character string {1
   instead of reporting a syntax error in the regular expression.   POSIX.2  allows
   this behavior as an extension, but portable scripts should avoid it.

Also depending on which shell you are executing in the '*' character might get expanded.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
  • I'm using bash 3.2 on OS X. The -E switch doesn't help, either (added it in my original question) – slhck Nov 16 '10 at 09:50
2

You can make use of the following command:

$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>

# Use -P option to enable Perl style regex \d.
$ grep -P  '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>

# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345

# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$ 
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Sorry, i had a typo there. The test file is right, and the regex matches in the online tool. – slhck Nov 16 '10 at 09:46
0

Try this:

[a-z]\d{5}[.][a-z]{2}\d{6}

Valentin H
  • 7,240
  • 12
  • 61
  • 111
0

Try this expression in grep:

[a-z]\d{4}[a-z]*\.[a-z]*\d*
Paweł Nadolski
  • 8,296
  • 2
  • 42
  • 32