egrep + quantifier not working

Question

egrep isn't matching in the following example and from everything I've read it should be. The expression is '{% +' what I'm trying to accomplish is match on all the {% %} brackets in my markdown files. From my current understanding it should match {% then one or more spaces after that, but fail to match if there is no space. I can use the same expression in PowerShell and it matches so I'm wondering what it is I'm missing.

Snippet to match against

{% highlight ruby %}
{% endhighlight %}

cat file.md | egrep '{% +'

In this case replace the `+` quantifier with a `*` quantifier. What is the problem? — Casimir et Hippolyte, Aug 19 '16 at 19:48
* does work, but it also allows for no space. How could I ensure a space is present. I though that's what + would do, but doesn't seem to work. — joshduffney, Aug 19 '16 at 19:59
Try removing `+` and see if it works. The quantifier is redundant since you need to match 1 or more. If there is 1, `'{% '` is already enough. Also, there may be a tab, not a space. Try `[[:blank:]]` instead of the literal space. — Wiktor Stribiżew, Aug 19 '16 at 19:59

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

For me, your regex works as expected. Given an input file file.md containing:

{% highlight ruby %}
{% endhighlight %}
not this line, though
nor {%this%}

When I run your command (avoiding UUoC), I get the output shown:

$ egrep '{% +' file.md
{% highlight ruby %}
{% endhighlight %}
$

You've not identified which version of egrep you are using and which platform you are using it on. I'm running Mac OS X 10.11.6 and using egrep (BSD grep) 2.5.1-FreeBSD (but I also get the same result with GNU Grep 2.25).

You should be aware, though, that { is a metacharacter to egrep, and the problem may be that it is not handling the initial { as you expect.

For example, here's a more complex egrep invocation that should only select the endhighlight line:

$ egrep '\{% {1,4}[a-z]{4,20} {1,4}%\}' file.md
{% endhighlight %}
$

I used the backslashes to escape the first and last braces. The {n,m} notation means n ≤ x ≤ m matches of the preceding regex (blank and [a-z]). You can omit ,m; you can use {4,} too — check the manual to understand these. However, on my machine, I can also run:

$ egrep '{% {1,4}[a-z]{4,20} {1,4}%}' file.md
{% endhighlight %}
$

Presumably, because the first { doesn't start an {n,m} sequence, it is treated as an ordinary character.

If you look at the POSIX specification for Extended Regular Expressions, you'll find that it says using { like that is undefined behaviour:

*+?{

The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

If these characters appear first in an ERE, or immediately following a <vertical-line>, <circumflex>, or <left-parenthesis>

If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)

So, according to POSIX, you are using a regex that produces undefined results. Therefore, you are getting a result that POSIX deems acceptable.

Clearly, you should be able to use the following and get the result you expect:

$ egrep '\{% +' file.md
{% highlight ruby %}
{% endhighlight %}
$

By using the escape "\"character on { and } I was able to get the + quantifier to work as I expected. I appreciate you pointing out that it's also a metacharacter. It makes complete sense now why it wasn't working before. I also greatly appreciate the long and detailed comment. It was extremely useful thank you. — joshduffney, Aug 20 '16 at 00:44

egrep + quantifier not working

1 Answers1