R has several special locale-independent character classes for regular expressions.
From ?regex
:
‘[[:alnum:]]’ means ‘[0-9A-Za-z]’, except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set.
I'd like to know when locale-specific problems can occur.
I tried two examples based on the information in the ?Comparison
help page, that describes how strings are sorted:
in Estonian ‘Z’ comes between ‘S’ and ‘T’
and
in Danish ‘aa’ sorts as a single letter, after ‘z’
In the first example, I would expect T, U, V, W, X and Y not to match. In the second example, I would expect aa not to match.
Sys.setlocale("LC_ALL", "Estonian")
grepl("[A-Z]", LETTERS)
Sys.setlocale("LC_ALL", "Danish")
grepl("[a-z]", "aa")
Since all values return TRUE
, it seems that locale is not a problem here.
Can you find an example where locale causes traditional regular expression classes like [a-z]
to fail?
UPDATE: I have a partial answer: accented roman characters behave differently using [a-zA-Z]
vs. [[:alpha:]]
. I'm still interested to know if there are more examples of differences, and whether locale or encoding affect matching of non-roman characters, and indeed, how you match non-roman characters.