I have a problem with converting unicode characters in R. I am following this approach, but stri_unescape_unicode
from library stringi
fails to return correct value in some cases. Let me show an example where the correct value should be word Tomáš:
library(stringi)
test <- "Tom<U+00E1><U+009A>"
test <- gsub("<U\\+(....)>", "\\\\u\\1", test)
stri_unescape_unicode(test)
[1] "Tomá\u009a"
However, if š is represented by U+0161 rather than U+009A, everything works as expected:
test2 <- "Tom<U+00E1><U+0161>"
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
stri_unescape_unicode(test2)
[1] "Tomáš"
Now, my problem is that I have large character
vector with numerous elements like test
and stri_unescape_unicode
fails on some charactes like <U+009A>
here. My question is:
- Is there a way to convert
<U+009A>
withstri_unescape_unicode
or any other method? - Alternatively, is there a way to automatically replace unicodes in case
stri_unescape_unicode
fails? That is, in my example"Tom<U+00E1><U+009A>"
should become"Tom<U+00E1><U+0161>"
?