1

How can I create a UTF-8 string like "\u0531" in R, but taking the code "0531" as a variable?

I have a bad string (consisting of "UTF-8 codes in tags"), which I would like to dynamically turn into a good string (proper UTF-8 string).

badString <- "<U+0531><U+0067>"
goodString <- "Աg" # how can I generate that by a function?

turnBadStringToGoodString<- function (myString){
  newString <- gsub("<U\\+([0-9]{4})>","\\u\\1",myString)
  newString2 <- parse(text = paste0("'", newString, "'"))[[1]]
  return (
    newString2
    )
}

turnBadStringToGoodString ( badString )
# returns an expression. What to do next?

Plase note that the desired outcome can be achieved by manually typing

"\u0531\u0067"

But how can that be done with a function? Thank you for ideas.

Also related: Converting a \u escaped Unicode string to ASCII

Community
  • 1
  • 1
nilsole
  • 1,663
  • 2
  • 12
  • 28

1 Answers1

1

I would suggest to use gsubfn with a regex that would capture the digits and return only the converted Unicode symbols:

library(gsubfn)
badString <- "<U+0531><U+0067>"
turnBadStringToGoodString<- function (myString){
   return (
     gsubfn("<U\\+(\\d{4})>",  ~ parse(text = paste0("'", paste0("\\u",x), "'"))[[1]],myString)
   )
}
turnBadStringToGoodString(badString)
[1] "Աg"

A bit of explanation:

  • <U\\+(\\d{4})> matches <, U, + and then captures into Group 1 4 digits and then just matches >
  • The value in Group 1 is passed to the callback function (with ~, we refer to it as x inside), and perform the conversion inside the callback.
  • gsubfn handles all non-overlapping matches in the input string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Looks good at first glance. Will implement that and confirm if it works as expected. :) – nilsole Sep 16 '16 at 09:30
  • 1
    Turned out my RStudio had problems to `View()` the `badString` in the correct manner. Had to set `Sys.setlocale(locale = "Russian")` to get the right output with `read.csv()`. http://stackoverflow.com/a/34256414/2381339` – nilsole Sep 16 '16 at 11:15