1

Building on the Splitting String based on letters case answer;

lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)

results in

"Deutsch"      "Esperanto"    "Italiano"     "Nederlands"   "Nedersaksies" "NorskРусский"

The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv and Encoding workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?

Sotos
  • 51,121
  • 6
  • 32
  • 66

1 Answers1

0

Use unicode property \p{Lu} that means an uppercase (u) letter (L) in any alphabet. See http://www.regular-expressions.info/unicode.html

lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=\p{Lu})", perl = TRUE)
Sotos
  • 51,121
  • 6
  • 32
  • 66
Toto
  • 89,455
  • 62
  • 89
  • 125