Positive lookbehind on non-ASCII characters in R

Question

I have an R function which tries to capitalise the first letter of every "word"

proper = function(x){
  gsub("(?<=\\b)([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}

This works pretty well, but when I have a word with a Māori macron in it like Māori I get improper capitalisation, e.g.

> proper("Māori")
[1] "MāOri"

Clearly the RE engine thinks the macron ā is a word boundary. Not sure why.

Does **[this](https://stackoverflow.com/questions/18509527/first-letter-to-upper-case)** post help? `str_to_title` from `stringr` doesn't capitalize the o in Māori either. — tyluRp, Dec 02 '17 at 07:07
Is there a unicode flag (like the one used in https://regex101.com/r/unVXlI/1)? BTW, for something like a word boundary, it may not be necessary to use a lookbehind. — wibeasley, Dec 02 '17 at 07:22

score 3 · Accepted Answer · answered Dec 02 '17 at 11:40

3

Since you are using a PCRE regex engine (enabled with perl=TRUE) you must pass the (*UCP) flag to the regex so that all shorthands and word boundaries could detect correct symbols/locations inside Unicode text:

proper = function(x){
  gsub("(*UCP)\\b([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}
proper("Māori")
## [1] "Māori"

See the R demo.

Note that \b is already a zero-width assertion and does not have to be placed into a positive lookbehind, i.e. (?<=\b) = \b.

answered Dec 02 '17 at 11:40

Wiktor Stribiżew

607,720
39
448
563

Also, `"(*UCP)\\b([[:alpha:]])"` can be replaced with a shorter `"(*UCP)\\b(\\p{L})"` – Wiktor Stribiżew Dec 02 '17 at 17:33
1

Excellent. Thanks! – James Curran Dec 02 '17 at 21:18

score 0 · Answer 2 · answered Dec 02 '17 at 09:53

\b basically denotes a boundary on characters other than [a-zA-Z0-9_] which includes multibyte characters as well unless a modifier called Unicode is set to affect engine behavior.

Unfortunately, gsub in R doesn't have this flag or I couldn't find anything about it in documentations.

A workaround would be:

(?<!\\S)([[:alpha:]])

which, in other hand, obviously fails on āmori.

Positive lookbehind on non-ASCII characters in R

2 Answers2