37

I entered a text string in .csv file , which includes unicode symbols as: \U00B5 g/dL. In .csv file as well as read in R data frame:

enter image description here

test=read.csv("test.csv")

enter image description here

\U00B5 would produce the micro sign- µ. R read it into data file as it is (\U00B5). However when I print the string it shows as \\U00B5 g/dL.
Alternatively, manually entering the code works fine.

varname <- c("a", "b", "c")
labels <- c("A \U00B5 g/dL", "B \U00B5 g/dL", "C \U00B5 g/dL")
df <- data.frame(varname, labels)
test <- data.frame(varname, labels)
test
#  varname   labels
#  1       a A µ g/dL
#  2       b B µ g/dL
#  3       c C µ g/dL

I wonder how could I get rid of the escape sign \ in this case and have it print out the symbol. Or, if there another way to print out the symbol in R.

Thank you very much for this help!

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
outboundbird
  • 633
  • 1
  • 7
  • 15
  • When you say, *However when I print the string it shows as `\\U00B5 g/dL`.*, where are you printing the string? – Rich Scriven Mar 25 '15 at 20:06
  • Thanks Richard, I print it in the R console. – outboundbird Mar 25 '15 at 20:14
  • 1
    It seems to me that the problem is less about printing the unicode character correctly than it is about correctly reading literal unicode text from a file and having it interpreted as a unicode string. – Alex A. Mar 25 '15 at 20:18
  • I agree. Have you tried encoding the file with UTF-8? – Rich Scriven Mar 25 '15 at 20:21
  • 1
    If you encode the file with UTF-8 as @RichardScriven suggests, you can use `fileEncoding="UTF-8", allowEscapes=T` in your call to `read.csv()`. – Alex A. Mar 25 '15 at 20:22
  • 3
    If you literally have "\U00B5 g/dL" in a text file, that's not Unicode. That's just an ASCII slash followed by letters and numbers. It's unclear to me exactly what you have in your csv file. It would be nice if you provided a reproducible example (specifically showing the bytes of the file) – MrFlick Mar 25 '15 at 20:22
  • No. I just want something quick and simple on the special symbols. – outboundbird Mar 25 '15 at 20:23
  • @MrFlick: I suppose what I meant was correctly interpreting such ASCII sequences in the file as unicode strings when they get to R. – Alex A. Mar 25 '15 at 20:28
  • The example data frame you included works fine (at least for me). It's reading from the csv file that doesn't work as expected. – Alex A. Mar 25 '15 at 20:29
  • 1
    @AlexA. Yes. That's the problem! If I manually enter it, it works fine. But if I import from a `.csv` file. It would add ` \\` . – outboundbird Mar 25 '15 at 20:31

1 Answers1

64

Well, first understand that certain characters in R must be escaped if they are outside the standard ASCII-characters. Typically this is done with a "\" character. That's why you need to escape this character when you write a string in R:

a <- "\" # error
a <- "\\" # ok.

The "\U" is a special indicator for unicode escaping. Note that there are no slashes or U's in the string itself when you use this escaping. It is just a shortcut to a specific character. Note:

a <- "\U00B5"
cat(a)
# µ
grep("U",a)
# integer(0)
nchar(a)
# [1] 1

This is very different than the string

a <- "\\U00B5"
cat(a)
# \U00B5
grep("U",a)
# [1] 1
nchar(a)
# [1] 6

Normally when you import a text file, you would encode non-ASCII character in whatever encoding is used by the file (UTF-8, or Latin-1 are the most common). They have special bytes to represent these characters. It's not "normal" for a text file to have an ASCII escape sequence for unicode characters. This is why R doesn't attempt to convert "\U00B5" to a unicode character because it assumes that if you had wanted a unicode character, you would have just used it directly.

The easiest way to re-interpet your ASCII character values would be to use the stringi package. For example

library(stringi)
a <- "\\U00B5"
stri_unescape_unicode(gsub("\\U","\\u",a, fixed=TRUE))

(the only catch is that we needed to convert "\U" to the more common "\u" so the function properly recognized the escape). You can do this to your imported data with

test$label <- stri_unescape_unicode(gsub("\\U","\\u",test$label, fixed=TRUE))
Stibu
  • 15,166
  • 6
  • 57
  • 71
MrFlick
  • 195,160
  • 17
  • 277
  • 295