6

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.

In the second chapter, Jan has a section on "special characters:"

Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.

(emphasis mine)

I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?

scohe001
  • 15,110
  • 2
  • 31
  • 51
  • Close parenthesis is a special character. All of them must be escaped (if used outside of a character class) if they should be parsed as literal chars. – Wiktor Stribiżew Sep 20 '18 at 19:05
  • @WiktorStribiżew I understand that close parenthesis is a special character. But **why** is it that close parenthesis is special if close brace and close bracket *aren't*? The duplicate doesn't answer this either. – scohe001 Sep 20 '18 at 19:07
  • 2
    @WiktorStribiżew I don't see how either one of those two duplicates answers the "**why**" question. – Aran-Fey Sep 20 '18 at 19:07
  • 6
    @WiktorStribiżew: I am afraid that you keep misunderstanding the question. How do you explain the difference between `)` and `]`, `}` ? –  Sep 20 '18 at 19:10
  • 1
    @WiktorStribiżew But how is it different from `]` and `}`? Why is `)` a special character if `]` and `}` aren't? *Why* is it considered a special character? What makes it special? – Aran-Fey Sep 20 '18 at 19:10
  • Close and open parentheses are not always special, in BRE POSIX flavor, they are not. Square close bracket sometimes is special, as in JS character class. – Wiktor Stribiżew Sep 20 '18 at 19:18
  • 2
    @WiktorStribiżew But **why**? – Aran-Fey Sep 20 '18 at 19:19
  • 1
    I think @Yves gives a convincing argument. However, to answer the question literally, you would need to ask the book author. – Bergi Sep 20 '18 at 19:39
  • @Bergi [this answer](https://stackoverflow.com/a/400316/2602718) also includes `)` but auspiciously leaves off `]` and `}` for PCRE, so I'm assuming that more people than just the author agree with this. – scohe001 Sep 20 '18 at 19:44
  • 2
    Depends on the regex engine really. Most have an implicit stack counter for `(…)` groups, where even the first closing `)` would lead to an invalid state. Charclasses `[]` and quantifier `{}` curlies are rather localized syntax constructs however, no need for counting. – mario Sep 20 '18 at 19:48
  • @scohe001 That answer is specifically about what needs to be escaped, not what counts as a special character. Escaping is all about context - you escape a value to insert it where in a regex? We usually assume "*anywhere but not in a character class or repetition modifier*", i.e. so that the escaped value will cause a literal match. This might be inserted into a group however, so that `)` must be escaped to not prematurely close the group. – Bergi Sep 20 '18 at 19:51
  • 1
    "Why" questions are often outside StackOverflow's scope, beyond "that's what the standard states" (substitute "specification" &c as appropriate). See [What is the rationale for closing "why" questions on a language design?](https://meta.stackexchange.com/questions/170394/what-is-the-rationale-for-closing-why-questions-on-a-language-design), quoting topic rules: "You should only ask practical, answerable questions based on actual problems that you face." -- once "how" has been communicated, you no longer face an *actual problem*, but merely have a point of curiosity. – Charles Duffy Sep 20 '18 at 20:48
  • 2
    I'd be happy to explain why I wrote what I wrote. But I can't do that while the question is on hold. I'm not going to answer in a comment. – Jan Goyvaerts Sep 22 '18 at 00:54
  • 2
    @JanGoyvaerts question is now reopened. – revo Jan 17 '19 at 10:51

4 Answers4

6

Short answer

The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.

Full answer

First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.

What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.

} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.

] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.

Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.

I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.

I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).

But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.

So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).

But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
1

The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:

If you forget to escape a special character where its use is not allowed, such as in +1, then you will get an error message.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. But there are a few exceptions. Java requires literal opening braces to be escaped. Boost and std::regex require all literal braces to be escaped.

] is a literal outside character classes. Different rules apply inside character classes. Those are discussed in the topic about character classes. Again, there are exceptions. std::regex and Ruby require closing square brackets to be escaped even outside character classes.

It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.

That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.

On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.

And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • *no flavour allows for it to occur on its own* in fact POSX regular expressions do. – revo Sep 20 '18 at 20:53
  • 1
    The 2006 edition of my regex tutorial did not cover the POSIX standard. The present edition on the website does. In the POSIX standard, `)` is not a special character on its own. Very few modern regex flavors behave this way. The ARE engine in Tcl and postgresql and some of the grammars in std::regex and boost::regex are notable exceptions that do. – Jan Goyvaerts Jan 18 '19 at 06:03
1

Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:

) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.

But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].

Why other engines don't follow this rule?

Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:

/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */

It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.

revo
  • 47,783
  • 14
  • 74
  • 117
0

From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.


Though IMO the same rule could apply to ), that's the way it is.

This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)