Building on the Splitting String based on letters case answer;
lang <- "DeutschEsperantoItalianoNederlandsNedersaksiesNorskРусский"
strsplit(lang, "(?!^)(?=[[:upper:]])", perl = T)
results in
"Deutsch" "Esperanto" "Italiano" "Nederlands" "Nedersaksies" "NorskРусский"
The problem is the last pair is not separated as Russian is converted to UTF-8 (there will be more variation in the strings; e.g. more or less all other languages in Wikipedia). I checked online Regex testers and other SO answers but they are not much help with R. Tried iconv
and Encoding
workarounds in base R as well (can't seem to convert to UTF-16; conversion to bytes doesn't help). Thoughts?