Automatically escape unicode characters

Question

How can you display a unicode string, say:

x <- "•"

using its escaped equivalent?

y <- "\u2022"

identical(x, y)
# [1] TRUE

(I'd like to be able to do this because CRAN packages must contain only ASCII, but sometimes you want to use unicode in an error message or similar)

It is strange because I tested your `identical(x, y)` on different machines and the results are not invariable. Using the function `stri_escape_unicode` in the answer: `stri_escape_unicode(x)` on a Windows machine yields `"\\u0095"`. EDIT: well, it has something to do with the encoding. — Xin Yin, Aug 14 '14 at 13:53

Xin Yin · Accepted Answer · 2014-08-14T20:54:19.123

After digging into some documentation about iconv, I think you can accomplish this using only the base package. But you need to pay extra attention to the encoding of the string.

On a system with UTF-8 encoding:

> stri_escape_unicode("你好世界")
[1] "\\u4f60\\u597d\\u4e16\\u754c"

# use big endian
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 4f 60 59 7d 4e 16 75 4c

> x <- "•"
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)    
[[1]]
[1] 20 22

But, if you are on a system with latin1 encoding, things may go wrong.

> x <- "•"
> y <- "\u2022"
> identical(x, y)
[1] FALSE
> stri_escape_unicode(x)
[1] "\\u0095" # <- oops!

# culprit
> Encoding(x)
[1] "latin1"

# and it causes problem for iconv
> iconv(x, Encoding(x), "Unicode")
Error in iconv(x, Encoding(x), "Unicode") : 
  unsupported conversion from 'latin1' to 'Unicode' in codepage 1252
> iconv(x, Encoding(x), "UTF-16BE")
Error in iconv(x, Encoding(x), "UTF-16BE") : 
  embedded nul in string: '\0•'

It is safer to cast the string into UTF-8 before converting to Unicode:

> iconv(enc2utf8(enc2native(x)), "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 20 22

EDIT: This may cause some problems for strings already in UTF-8 encoding on some particular systems. Maybe it's safer to check the encoding before conversion.

> Encoding("•")
[1] "latin1"
> enc2native("•")
[1] "•"
> enc2native("\u2022")
[1] "•"
# on a Windows with default latin1 encoding
> Encoding("测试") 
[1] "UTF-8"
> enc2native("测试") 
[1] "<U+6D4B><U+8BD5>"   # <- BAD!

For some characters or lanuages, UTF-16 may not be enough. So probably you should be using UTF-32 since

The UTF-32 form of a character is a direct representation of its codepoint.

Based on above trial and error, below is probably one safer escape function we can write:

unicode_escape <- function(x, endian="big") {
  if (Encoding(x) != 'UTF-8') {
    x <- enc2utf8(enc2native(x))
  }
  to.enc <- ifelse(endian == 'big', 'UTF-32BE', 'UTF-32LE')

  bytes <- strtoi(unlist(iconv(x, "UTF-8", "UTF-32BE", toRaw=T)), base=16)
  # there may be some better way to do thibs.
  runes <- matrix(bytes, nrow=4)
  escaped <- apply(runes, 2, function(rb) {
    nonzero.bytes <- rb[rb > 0]
    ifelse(length(nonzero.bytes) > 1, 
           # convert back to hex
           paste("\\u", paste(as.hexmode(nonzero.bytes), collapse=""), sep=""),
           rawToChar(as.raw(nonzero.bytes))
           )
  })
  paste(escaped, collapse="")
}

Tests:

> unicode_escape("•••ERROR!!!•••")
[1] "\\u2022\\u2022\\u2022ERROR!!!\\u2022\\u2022\\u2022"
> unicode_escape("Hello word! 你好世界！")
[1] "Hello word! \\u4f60\\u597d\\u4e16\\u754c!"
> "\u4f60\u597d\u4e16\u754c"
[1] "你好世界"

It isn't, but `iconv` is. The other answer used `stri_escape_unicode` and I'm just trying to show that you don't have to use `stri_escape_unicode`. — Xin Yin, Aug 14 '14 at 15:46
This function fails for some inputs. For example, `"\u00e9"` ("é") is not ASCII but passes through without being encoded. `"\u0200"` ("Ȁ") becomes `"\002"` (control code: start of text). Also: `"\u0404"` ("Є") -> `"\\u44"` ("D"), `"\U0001f600"` (grinning face) -> `"\\u01f6"` ("Ƕ"), `"\u040413"` (`"\u0404"` followed by `"13"`) -> `"\\u4413"` ("䐓"), and `"\U0001f601"` (grinning face with smiling eyes) -> `"\\u01f601"` ("Ƕ01"). — mvkorpel, Sep 18 '18 at 13:23

score 7 · Answer 2 · answered Aug 14 '14 at 13:36

7

The package stringi has a method for doing this

stri_escape_unicode(y)
# [1] "\\u2022"

answered Aug 14 '14 at 13:36

konvas

14,126
2
40
46

2

Should've known that `stringi` would have a function for this! – hadley Aug 14 '14 at 16:20

score 2 · Answer 3 · answered Sep 08 '18 at 06:17

I wrote a small package called uniscape that can convert non-ASCII characters to the corresponding "\u1234" or "\U12345678" Unicode escape codes (obviously with a literal backslash). It can do so for any character or only for characters inside an R string (single or double quoted). The following example shows how u_escape converts a character. The output is then surrounded with quotes, parsed, and evaluated. The final result matches the original character.

x <- rawToChar(as.raw(c(0xe2, 0x80, 0xa2)))
Encoding(x) <- "UTF-8"
x
# [1] "•"
x_u <- uniscape::u_escape(x)
x_u
# [1] "\\u2022"
y <- eval(parse(text = paste0('"', x_u, '"')))
y
# [1] "•"
identical(x, y)
# [1] TRUE

The package (on GitHub) also provides RStudio addins for convenience. The addins operate on the active source editor document. The package has no hard dependencies except rstudioapi.

This picture shows an example document with a selected text area and the RStudio addin window with three uniscape addins. "Escape selection" addin has been selected.

This is the result after applying "Escape selection", with the encoding sequence of each non-ASCII character automatically highlighted (selected).

After undoing the previous operation, this is the result for "Escape strings in file". Each affected R string in the active file is automatically highlighted by the addin. Commented strings are ignored. "Escape selected strings" does the same but only for the selected text area.

score 1 · Answer 4 · answered May 11 '19 at 11:16

1

R automatically escapes unicode in C locale:

x <- "•"
Sys.setlocale(locale = 'C')
print(x)
# [1] "<U+2022>"

answered May 11 '19 at 11:16

Jeroen Ooms

31,998
35
134
207

I cannot reproduce this on windows – moodymudskipper Sep 22 '20 at 23:19

Automatically escape unicode characters

4 Answers4

Tests:

Linked