2

I was told that [[a-z][0-9]] is equivalent to [a-z0-9], but I tried a few examples:

grepl("[[a-z][0-9]]", "d") returns FALSE.

Similarly, grepl("[[:alpha:][0-9]]", "d") returns FALSE while things like grepl("[[:upper:][:lower:]]", "d") works fine.

May I ask if this indicates that double square brackets could only be used for combining things of the form "[:...:]" but not for things like [A-z] or [0-9]?

If so, why would R stop us from doing so? And what do grepl("[[a-z][0-9]]", "d") and grepl("[[a-z]]", "d") actually mean?

Forthermore, I know that we need to use double square brackets, say, for things like "[[:digit:]]", because "[:digit:]" would rather search for ":", "d", "i", "g" or "t" (from this question). But how exactly is the structure of "[[:digit:]]" being interpreted? (just a guess: does R interpret it as the trivial union of [:digit:] with itself so that it's just a 'readable' [:digit:] for R?)

J-A-S
  • 368
  • 1
  • 8

1 Answers1

2

You should be careful with square brackets inside regular expressions. Now, I will assume you are using the default TRE regex library that is used with the base R regex functions (when no perl=TRUE is passed).

In this case, you should differentiate between

  • [ and ] that mark the start and end of the POSIX character class, e.g. [:alpha:]
  • [ and ] that mark the start and end of a bracket expression
  • unescaped ] that is not preceded with a matching unescaped open [ is treated as a literal ] char.

The [[a-z][0-9]] regex is not equal to [a-z0-9].

  • [[a-z][0-9]] matches strings like [1], a1] and means:
    • [[a-z] - a bracket expression matching a [ char or any lowercase ASCII letter
    • [0-9] - a digit
    • ] - a ] char.

The [a-z0-9] bracket expression just matches a lowercase ASCII letter or digit.

There is no such a construct in regex as double square brackets. Inside a character class, [ can be used anywhere to match a [ char. ] only matches a ] when it is placed at the start of a bracket expression:

  • [a-z[] matches a single char, a lowercase ASCII letter or [
  • [][a-z] matches a single char, a lowercase ASCII letter, [ or ]
  • [[a-z]] matches a lowercase ASCII letter or [ and then a ] char (so, 2 chars in total)

More things to consider

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you for your helpful clarification! I understand most of them now. May I further ask how does R know whether a `[` inside a character class is a **literal char** or a **start of a POSIX character class**? (as you said "*Inside a character class, [ can be used anywhere to match a [ char*") Is that because R detects the `:` that follows `[` for the case of POSIX character class? – J-A-S Oct 24 '20 at 20:20
  • @J-A-S Regular expressions are parsed from left to right, once the regex engine steps inside a bracket expression, it knows exactly what it can match inside a bracket expression. So, if `[` is matched there, the next char is checked, and since it is `:`, the next char is checked, etc, until `:]` and then the found name is checked against valid POSIX character class names, and then goes on... So, be careful: `[[:]` will fail with `Unknown character class name` error, make sure there is no "dangling" `[:` in the bracket expression, `[:[]` will work. In a smarter PCRE regex, you can use `[[:]`. – Wiktor Stribiżew Oct 24 '20 at 21:53
  • Thanks :) That makes perfect sense. Btw thank you for the 2 extra information you put in the last section – J-A-S Oct 25 '20 at 03:55
  • Hi, sorry to bother you again, may I ask why `'[]-/]'` and `'[]-//]'` are invalid character ranges while `'[]/-]'` and `'[]//-]'` are not? – J-A-S Oct 26 '20 at 14:45
  • @J-A-S In general, see [this thread](https://stackoverflow.com/questions/3697202/including-a-hyphen-in-a-regex-character-bracket) and [this answer](https://stackoverflow.com/a/4068725/3832970), just mind that a `-` creates a range between two chars in the Unicode table, and these two chars must comply with the rule: the first one must have a lower ID in the table and the second (upper range limit) must have a higher ID. If the IDs go in the opposite order you get invalid range error. If you need to use a literal `-`, put it at the end of the bracket expression. – Wiktor Stribiżew Oct 26 '20 at 14:50
  • Thank you for your continued and effective helps! I'm probably trying to be too general here, but are there situations where we want to match so many characters that we are running out of proper places? (for instance, if we want to match a literal `]` and a literal `-`, we would have occupied the beginning and the end of a bracket expression) – J-A-S Oct 26 '20 at 15:07
  • @J-A-S Correct, and this is enough. Only `]` and `-` are that specific. Then, no need to escape any other chars. Just mind that a literal `^` needs to be non-first char in a bracket expression, else, it created a negated bracket expression. – Wiktor Stribiżew Oct 26 '20 at 15:07
  • save the day haha – J-A-S Oct 26 '20 at 15:09
  • Hi, umm, there is another question [here](https://stackoverflow.com/questions/64549809/how-do-greedy-lazy-non-greedy-possessive-quantifiers-work-internally) which is closely related to regex. You are more than welcome to answer it (if it does not take too long for you) :) – J-A-S Oct 27 '20 at 07:41