21

I am trying to match both [ and ] with grep, but only succeeded to match [. No matter how I try, I can't seem to get it right to match ].

Here's a code sample:

echo "fdsl[]" | grep -o "[ a-z]\+" #this prints fdsl
echo "fdsl[]" | grep -o "[ \[a-z]\+" #this prints fdsl[
echo "fdsl[]" | grep -o "[ \]a-z]\+" #this prints nothing
echo "fdsl[]" | grep -o "[ \[\]a-z]\+" #this prints nothing

Edit: My original regex, on which I need to do this, is this one:

echo "fdsl[]" | grep -o "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!@#&?-]\+" 
#this prints nothing

N.B: I have tried all the answers from this post but that didn't work on this particular case. And I need to use those brackets inside [].

codeforester
  • 39,467
  • 16
  • 112
  • 140
Jahid
  • 21,542
  • 10
  • 90
  • 108

3 Answers3

21

According to BRE/ERE Bracketed Expression section of POSIX regex specification:

  1. [...] The right-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.

and

  1. [...] If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.

Therefore, your regex should be:

echo "fdsl[]" | grep -Eo "[][ a-z]+"

Note the E flag, which specifies to use ERE, which supports + quantifier. + quantifier is not supported in BRE (the default mode).

The solution in Mike Holt's answer "[][a-z ]\+" with escaped + works because it's run on GNU grep, which extends the grammar to support \+ to mean repeat once or more. It's actually undefined behavior according to POSIX standard (which means that the implementation can give meaningful behavior and document it, or throw a syntax error, or whatever).

If you are fine with the assumption that your code can only be run on GNU environment, then it's totally fine to use Mike Holt's answer. Using sed as example, you are stuck with BRE when you use POSIX sed (no flag to switch over to ERE), and it's cumbersome to write even simple regular expression with POSIX BRE, where the only defined quantifier is *.

Original regex

Note that grep consumes the input file line by line, then checks whether the line matches the regex. Therefore, even if you use P flag with your original regex, \n is always redundant, as the regex can't match across lines.

While it is possible to match horizontal tab without P flag, I think it is more natural to use P flag for this task.

Given this input:

$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!@#$%^&*()_+-=~\`89"
fds     l[]kSAJD<>?,./:";'{}|[]\!@#$%^&*()_+-=~`89

The original regex in the question works with little modification (unescape + at the end):

$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!@#$%^&*()_+-=~\`89" | grep -Po "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!@#&?-]+"
fds     l[]kSAJD
?,./:";'
[]
!@#$
&*()_+-=~
89

Though we can remove \n (since it is redundant, as explained above), and a few other unnecessary escapes:

$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!@#$%^&*()_+-=~\`89" | grep -Po "[ \[\]\ta-zA-Z/:.0-9_~\"'+,;*=()$\!@#&?-]+"
fds     l[]kSAJD
?,./:";'
[]
!@#$
&*()_+-=~
89
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    GNU grep equates the usage of `\+` in BRE mode with the usage of `+` in ERE mode, but I agree that it's a better idea to use EREs if they're needed (if for no reason other than portability). –  May 05 '15 at 04:55
  • @ChronoKitsune: Thanks for the hint, I found the GNU documentation about this: http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html#Basic-vs-Extended – nhahtdh May 05 '15 at 04:57
  • yeah, `\n` is useless here, so for `sed`. – Jahid May 05 '15 at 05:40
11

One issue is that [ is a special character in expression and it cannot get escaped with \ (at least not in my flavors of grep). Solution is to define it like [[].

Eiko
  • 25,601
  • 15
  • 56
  • 71
skotka
  • 169
  • 1
  • 6
3

According to regular-expressions.info:

In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.

... and ...

The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.

So, assuming that the particular flavor of regular expressions syntax supported by grep conforms to this, then I would have expected that "[ a-z[\]]\+" should have worked.

However, my version of grep (GNU grep 2.14) only matches the "[]" at the end of "fdsl[]" with this regex.

However, I tried using the other technique mentioned in that quote (putting the ] in a position within the character class where it cannot take on its normal meaning, and it seems to have worked:

$ echo "fdsl[]" | grep -o "[][a-z ]\+"
fdsl[]
Mike Holt
  • 4,452
  • 1
  • 17
  • 24
  • I'm not sure why `"[][a-z ]\+"` works - it's most likely a non-portable extension, since I can't find anything about it in the specification http://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html#tag_09_03_05 – nhahtdh May 05 '15 at 04:53
  • @nhahtdh I'm not sure why you're surprised that this works, since the only differences between this and the regex in your answer (which is, btw, a better answer since it provides a more concrete explanation of *why* this works) are minor. You have the space in a different spot, and instead of using the slash in front of the `+`, you use the `-E` option. – Mike Holt May 05 '15 at 05:03
  • @MikeHolt It's a valid concern, since by right, the POSIX specs doesn't define behavior for `\+` (as linked in my comment). Actually, it is **undefined behavior** by POSIX specs (section 9.3.2). – nhahtdh May 05 '15 at 05:06
  • @nhahtdh, can I use `+` in sed ? currently I am using this regex with `sed` and sticking with `\+` – Jahid May 05 '15 at 05:12
  • I agree it's a valid concern. It just seems like using a backslash to enable ERE syntax on an individual character basis is a well-known and widely supported convention. Perhaps my perception of this is skewed though. – Mike Holt May 05 '15 at 05:14
  • 1
    @Jahid Yes. I use `\+` very frequently in `sed`. – Mike Holt May 05 '15 at 05:15
  • @nhahtdh and Mike Holt, If using `+` and `\+` is disputed, what about `\{1,\}`, this can do the same thing...right..? – Jahid May 05 '15 at 05:17
  • 1
    @Jahid: If you are using GNU tools, `\+`, `\{1,\}` and `\?` are supported as extension to BRE. Only `*` is official part of BRE according to POSIX specs. – nhahtdh May 05 '15 at 05:22
  • 2
    @Jahid I wouldn't say `\+` is disputed so much as it is perhaps not as portable, and evidently not POSIX-compliant. If you're using GNU grep on some version of Linux, and you don't care about portability, you're probably safe in using `\+`. The man page states explicitly that a non-metacharacter can be turned into a metacharacter by escaping it with a backslash. So this is not some undocumented feature. – Mike Holt May 05 '15 at 05:25
  • 1
    It should be noted that I don't disagree with anything nhahtdh has said, and in fact I think his answer is a better answer. We were just approaching the problem from different angles. He framed it in a POSIX context, and I happily assumed a GNU/Linux context. – Mike Holt May 05 '15 at 05:32