29

For at least some cases, Asian characters are printable if they are contained in a matrix, or a vector, but not in a data.frame. Here is an example

q<-'天'

q # Works
# [1] "天" 

matrix(q) # Works
#      [,1]
# [1,] "天"

q2<-data.frame(q,stringsAsFactors=FALSE) 
q2 # Does not work
#          q
# 1 <U+5929>

q2[1,] # Works again.
# [1] "天"

Clearly, my device is capable of displaying the character, but when it is in a data.frame, it does not work.

Doing some digging, I found that the print.data.frame function runs format on each column. It turns out that if you run format.default directly, the same problem occurs:

format(q)
# "<U+5929>"

Digging into format.default, I find that it is calling the internal format, written in C.

Before I dig any further, I want to know if others can reproduce this behaviour. Is there some configuration of R that would allow me to display these characters within data.frames?

My sessionInfo(), if it helps:

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • 1
    Try setting `Sys.setlocale( locale="UTF-8" )`. It's a bit strange that rendering is inconsistent; however, `English_Canada.1252` isn't intended to handle Asian characters anyhow. – Kevin Ushey Jul 18 '13 at 06:32

2 Answers2

22

I hate to answer my own question, but although the comments and answers helped, they weren't quite right. In Windows, it doesn't seem like you can set a generic 'UTF-8' locale. You can, however, set country-specific locales, which will work in this case:

Sys.setlocale("LC_CTYPE", locale="Chinese")
q2 # Works fine
#  q
#1 天

But, it does make me wonder why exactly format seems to use the locale; I wonder if there is a way to have it ignore the locale in Windows. I also wonder if there is some generic UTF-8 locale that I don't know about on Windows.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • I'll leave this open for a bit before I check my own answer if someone wants to chime in with some more elaborate explanation of what is going on here. – nograpes Jul 18 '13 at 07:33
  • 1
    I just fell over this, and according to Duncan Murdoch it's a bug (though a tricky one to fix). https://stat.ethz.ch/pipermail/r-devel/2015-May/071252.html – Richie Cotton May 26 '15 at 07:08
  • @RichieCotton Ah! I knew it was a bug. Thanks for letting me know. – nograpes May 26 '15 at 14:17
  • When we are done do we need to change it back or is it automatic when you restart R? – Hack-R Oct 01 '17 at 16:39
  • Sys.setlocale("LC_CTYPE", locale="Persian") worked for farsi too – AmiNadimi Nov 24 '17 at 19:15
6

I just blogged about Unicode and R several days ago. I think your R editor is UTF-8 and this gives your illusion that R in your Windows handles UTF-8 characters.

The short answer is when you want to process Unicode (Here, it is Chinese), don't use English Windows, use a Chinese version Windows or Linux which by default is UTF-8.

Session info in my Ubuntu:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
Yin Zhu
  • 16,980
  • 13
  • 75
  • 117
  • @nograpes, In my Ubuntu and RStudio, there is no such display problem. – Yin Zhu Jul 18 '13 at 06:41
  • Interesting. Can you post your `sessionInfo()` so I can compare and contrast our configurations? – nograpes Jul 18 '13 at 06:44
  • I was mistaken, I cannot reproduce the problem under Ubuntu. However, although this helps, it doesn't explain if there is a configuration of R under Windows that can resolve the issue. – nograpes Jul 18 '13 at 07:01
  • I'm on CentOS 6.3 and I also can't set locale following this post. However, my `sessionInfo()` is partially unicode, partially not. – wdkrnls Apr 28 '15 at 18:40