8

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.

For instance:

data <-read.csv("mydata.csv", encoding="UTF-8")

data

will produce unicode characters, while:

data <-read.csv("mydata.csv", encoding="UTF-8")

data[,1]

will actually display Chinese characters.

If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.

I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.

My current locale is:

"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"

Any help to get R to consistently display Chinese characters would be greatly appreciated...

Levon
  • 138,105
  • 33
  • 200
  • 191
user1445297
  • 91
  • 1
  • 1
  • 4
  • Hm, this looks like a bug. For those interested, it is easily reproducible with this code: `x=c('中華民族');x;data.frame(x)`. Don't try pasting that code into the R Editor, just paste it right into the console or it won't work. – nograpes Aug 02 '12 at 02:29
  • See my answer at http://stackoverflow.com/questions/22876746/how-to-read-data-in-utf-8-format-in-r – Sathish Apr 10 '14 at 05:20

2 Answers2

4

Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.

You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.

@nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.

jcb
  • 167
  • 7
  • 1
    Actually, that doesn't work for me. Try running that code and then `print(y)`. I have made [a question about this](http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r) more directly addressing the problem. – nograpes Jul 18 '13 at 06:25
  • Interestingly, that now works for me (I switched to a different computer in the meantime, which might or might not make a difference). Thanks! – user1445297 Sep 24 '14 at 21:15
2

In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.

(1) Download 'Open Sheet' software.

(2) Open it properly. You can scroll the encoding method until you see the Chinese character displayed in the preview windows.

(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

MLE
  • 1,033
  • 1
  • 11
  • 30