0

I have a problem with converting unicode characters in R. I am following this approach, but stri_unescape_unicode from library stringi fails to return correct value in some cases. Let me show an example where the correct value should be word Tomáš:

library(stringi)
test <- "Tom<U+00E1><U+009A>"
test <- gsub("<U\\+(....)>", "\\\\u\\1", test)
stri_unescape_unicode(test)
[1] "Tomá\u009a"

However, if š is represented by U+0161 rather than U+009A, everything works as expected:

test2 <- "Tom<U+00E1><U+0161>"
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
stri_unescape_unicode(test2)
[1] "Tomáš"

Now, my problem is that I have large character vector with numerous elements like test and stri_unescape_unicode fails on some charactes like <U+009A> here. My question is:

  • Is there a way to convert <U+009A> with stri_unescape_unicode or any other method?
  • Alternatively, is there a way to automatically replace unicodes in case stri_unescape_unicode fails? That is, in my example "Tom<U+00E1><U+009A>" should become "Tom<U+00E1><U+0161>"?
pieca
  • 2,463
  • 1
  • 16
  • 34

1 Answers1

0

It appears that stri_unescape_unicode() has not failed. The character has been converted, but it is a control character ("single character introducer" U+009A) and is printed using its code. Garbage in, garbage out.

How R prints Unicode strings depends on the type of the console and the locale used. The following example has been run via the reprex package using code page 1252 in Windows. Even though the unprintable character is printed using the <U+> or \u style, the actual Unicode character does exist in the corresponding R string.

library(stringi)
test2 <- c("Tom<U+00E1><U+009A>", "Tom<U+00E1><U+0161>")
test2 <- gsub("<U\\+(....)>", "\\\\u\\1", test2)
unesc2 <- stri_unescape_unicode(test2)
unesc2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(unesc2)
#> [1] 5 5
cap2 <- capture.output(cat(unesc2, sep = "\n"))
cap2
#> [1] "Tomá<U+009A>" "Tomáš"
nchar(cap2)
#> [1] 12  5
which(nchar(cap2) > nchar(unesc2))
#> [1] 1
es2 <- encodeString(unesc2)
es2
#> [1] "Tomá\\u009a" "Tomáš"
nchar(es2)
#> [1] 10  5
which(nchar(es2) > nchar(unesc2))
#> [1] 1

I think capture.output() or encodeString() combined with nchar() can be used as above to detect strings with bad, i.e., unprintable in current locale, characters. Then, if it seems that all cases of U+009A should actually be U+0161, fixing those is a simple job for gsub(), e.g., gsub("\u009a", "\u0161", unesc2), and so on.

mvkorpel
  • 526
  • 6
  • 10
  • 1
    Yes, 'manually' replacing these characters is what I ended up doing back then. Thanks for answering! – pieca Dec 20 '18 at 20:13