1

The data I am using has many characters like "<U+XXXX>". Originally, it looks like this as a data point, "<U+043E><U+043A><U+0430><U+0437><U+044B>: 673".

I am curious what I should use to convert them easily and effectively into ordinary plain texts. I have rows of this Unicode in my table, and I am confused now.

I was looking for ways of conversion online, but most of them don't work. For example, I have tried this code on my data to convert it from UTF-8 into Latin; it failed.

www <- c("<U+043C>")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] <U+043C>

Also, I have tried this without arrows. Still, it doesn't convert.

www <- c("U+043C")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] U+043C

Alternatively, I tried this function.

example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
iconv(example, "UTF-8", "latin1")
[1] "<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025"

Any ideas, folks?

Akbar Ato
  • 29
  • 5
  • 1
    Try this https://stackoverflow.com/questions/49739800/convert-unicode-to-readable-characters-in-r if that works. – Shrey Mar 05 '22 at 10:02
  • Thank you, @Shrey --I tried this code. It still doesn't work. Maybe it is such due to different locale. I don't know. – Akbar Ato Mar 05 '22 at 10:36
  • Alright, maybe this https://stackoverflow.com/questions/38237358/how-to-write-unicode-string-to-text-file-in-r-windows could help. You can check. – Shrey Mar 05 '22 at 10:46
  • @Shrey's first suggestion worked for me. Could you add the code you tried and show the output in your question? – user2554330 Mar 05 '22 at 15:25

1 Answers1

3

When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.

What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:


f <- function(x) {
  
   x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
   stringi::stri_unescape_unicode(x)
}

So you can do:

example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
www <- c("<U+043C>")

f(example)
#> [1] "Показы: 58025"

f(www)
#> [1] "м"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87