4

I have an R function which tries to capitalise the first letter of every "word"

proper = function(x){
  gsub("(?<=\\b)([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}

This works pretty well, but when I have a word with a Māori macron in it like Māori I get improper capitalisation, e.g.

> proper("Māori")
[1] "MāOri"

Clearly the RE engine thinks the macron ā is a word boundary. Not sure why.

James Curran
  • 1,274
  • 7
  • 23
  • 1
    Does **[this](https://stackoverflow.com/questions/18509527/first-letter-to-upper-case)** post help? `str_to_title` from `stringr` doesn't capitalize the o in Māori either. – tyluRp Dec 02 '17 at 07:07
  • Is there a unicode flag (like the one used in https://regex101.com/r/unVXlI/1)? BTW, for something like a word boundary, it may not be necessary to use a lookbehind. – wibeasley Dec 02 '17 at 07:22

2 Answers2

3

Since you are using a PCRE regex engine (enabled with perl=TRUE) you must pass the (*UCP) flag to the regex so that all shorthands and word boundaries could detect correct symbols/locations inside Unicode text:

proper = function(x){
  gsub("(*UCP)\\b([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}
proper("Māori")
## [1] "Māori"

See the R demo.

Note that \b is already a zero-width assertion and does not have to be placed into a positive lookbehind, i.e. (?<=\b) = \b.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

\b basically denotes a boundary on characters other than [a-zA-Z0-9_] which includes multibyte characters as well unless a modifier called Unicode is set to affect engine behavior.

Unfortunately, gsub in R doesn't have this flag or I couldn't find anything about it in documentations.

A workaround would be:

(?<!\\S)([[:alpha:]])

which, in other hand, obviously fails on āmori.

revo
  • 47,783
  • 14
  • 74
  • 117