Does stringr's regex engine translate [a-z] into abcdefghijklmnopqrstuvwyz?

Question

Please correct me if I'm wrong but the pattern: [a-z] should match any lowercase character from a to z inclusive (i.e.) [a-z] == [abcdefghijklmnopqrstuvwxyz]

pattern <- "[a-z]"

stringr::str_detect(c("word", "12345"), pattern)

[1] TRUE FALSE

Is it the case that somewhere 'under the hood' [a-z] gets translated to [abcdefghijklmnopqrstuvwxyz] or is it simply understanding this to iterate through the characters based on some numeric system?

I don't think it does, because doing so is impractical with large ranges. Disclaimer: I don't know R. — InSync, May 30 '23 at 13:36
This depends on the underlying regex engine, but given that characters are easily convertible to integers it would be weird not to exploit that in the implementation. — Good Night Nerd Pride, May 30 '23 at 13:51
(In)famously, in Estonian the letters T, U, V come **after** Z: https://en.wikipedia.org/wiki/Estonian_orthography — Ben Bolker, May 30 '23 at 14:20

Ben Bolker · Accepted Answer · 2023-05-30T21:31:01.367

tl;dr don't worry about this too much, use [:alpha:] instead (which is guaranteed to match all alphabetic characters and is considered best practice).

@benson23's answer is good, but note that stringr uses the ICU engine (via the stringi package), documented here, which is different from the implementation used by base R (which uses TRE, or PCRE if perl = TRUE): see e.g. this answer.

In the ICU documentation pointed to above, it says for ranges that

The characters to include are determined by Unicode code point ordering

So presumably under the hood it is converting characters to their Unicode representation and testing whether they fall in the range or not (not enumerating).

Since Unicode points are independent of locale (I'm shouting because I just figured this out myself), this means that range-definition, unlike sorting/collation, will be locale-independent. (This is consistent with this answer about base-R regex range matching ...)

Sys.setlocale(category = "LC_COLLATE", locale = "et_EE")
[1] "et_EE"
stringr::str_detect("T", "[A-Z]")
[1] TRUE

For what it's worth this extensive answer points out that most built-in regex implementations are not locale-specific (i.e., behave like R's regex)

score 2 · Answer 2 · answered May 30 '23 at 13:50

You can take a look at the R documentation for regex (?regex):

A character class is a list of characters enclosed between ‘⁠[⁠’ and ‘⁠]⁠’ which matches any single character in that list ...... A range of characters may be specified by giving the first and last characters, separated by a hyphen. (Because their interpretation is locale- and implementation-dependent, character ranges are best avoided. Some but not all implementations include both cases in ranges when doing caseless matching.) The only portable way to specify all ASCII letters is to list them all as the character class ‘⁠[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]⁠’. (The current implementation uses numerical order of the encoding, normally a single-byte encoding or Unicode points.)

And from this answer, I think in English locales, [a-z] gets translated into the whole alphabet. Given R uses a global string pool, I guess both syntaxes makes no difference performance-wise (?)

is there a way to see the output of what [a-z] is matching in R? If you change the locale for example to Estonia it should match a different range — dcurrie27, May 30 '23 at 14:23
@dcurrie27 I think `[a-z]` is only a pattern to test whether your string has a match in it, I'm not sure how can we physically expand `[a-z]`. — benson23, May 30 '23 at 15:14

Does stringr's regex engine translate [a-z] into abcdefghijklmnopqrstuvwyz?

2 Answers2