Split String by Upper Case Consisting of Both Latin and Unicode

Question

Building on the Splitting String based on letters case answer;

lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)

results in

"Deutsch"      "Esperanto"    "Italiano"     "Nederlands"   "Nedersaksies" "NorskРусский"

The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv and Encoding workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?

Hmm, _fine_. Seriously though, what does that do exactly (it works)? — , Jan 28 '18 at 11:24
`\p{Lu}` is a unicode property that stands for `L` a letter & `u` uppercase in any alphabet — Toto, Jan 28 '18 at 11:25

score 0 · Accepted Answer · edited Jan 28 '18 at 12:27

0

Use unicode property \p{Lu} that means an uppercase (u) letter (L) in any alphabet. See http://www.regular-expressions.info/unicode.html

lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=\p{Lu})", perl = TRUE)

edited Jan 28 '18 at 12:27

Sotos

51,121
6
32
66

answered Jan 28 '18 at 11:27

Toto

89,455
62
89
125

Split String by Upper Case Consisting of Both Latin and Unicode

1 Answers1