14

I have a message (or warning or error) containing Unicode characters. (The string has UTF-8 encoding.)

x <- "\u20AC \ub124" # a euro symbol, and Hangul 'ne'
## [1] "€ 네"
Encoding(x)
## [1] "UTF-8"

Under Linux, this prints OK in a message if the locale is UTF-8 (l10n_info()$`UTF-8` returns TRUE).

I can force this, by doing, e.g.,

devtools::with_locale(
  c(LC_CTYPE = "en_US.utf8"),
  message(x)  
)
## € 네

Under Windows there are no UTF-8 locales, so I can't find an equivalent way to enforce correct printing. For example, with a US locale, the Hangul character doesn't display properly.

devtools::with_locale(
  c(LC_CTYPE = "English_United States"),
  message(x)  
)
## € <U+B124>

There's a related problem with Unicode characters not displaying properly when printing data frames under Windows. The advice there was to set the locale to Chinese/Japanese/Korean. This does not work here.

devtools::with_locale(
  c(LC_CTYPE = "Korean_Korea"),
  message(x)  
)
## ¢æ ³×   # equivalent to iconv(x, "UTF-8", "EUC-KR")

How can I get UTF-8 messages, warnings and errors to display correctly under Windows?

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Which Windows version are you referring to? I suspect that you cannot solve this problem on Windows 7, but maybe other versions finally gained proper Unicode support. (I do not hold my breath however) – mpiktas Sep 22 '15 at 10:38
  • @mpiktas I tested it under Windows 7, though AFAIK, R doesn't support UTF-8 locales for newer versions of Windows either, so I suspect the problem applies to all versions. Happy to be proved wrong. – Richie Cotton Sep 22 '15 at 10:47
  • 1
    I suspect that there is something going in with printing to stderr. `message` prints to stderr and then we have a problem: compare `cat(x, file = stdout())` to `cat(x, file = stderr())`. I tried looking into R source code and I only managed to find out that printing to stdout and to stderr is done via different functions, but I lack knowledge in R internals to find where are the roots of the problem. – mpiktas Sep 22 '15 at 12:16
  • Also if you look into `Encoding(capture.output(print(x)))` you will see that encoding is not "UTF-8". So I can only surmise, that somewhere along encoding information gets mangled when printing to stderr. – mpiktas Sep 22 '15 at 12:50
  • 3
    @mpiktas Yes, the problem with `capture.output` is a bug I spotted the other day. https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16539, and I've submitted this stderr issue as another bug https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16543 – Richie Cotton Sep 22 '15 at 18:35
  • There's also a bug in RStudio that causes incorrect display. https://support.rstudio.com/hc/communities/public/questions/205382758-UTF-8-strings-do-not-print-correctly-to-stderr-under-Windows – Richie Cotton Sep 28 '15 at 10:24

1 Answers1

1

I noticed that the help for the function Sys.setlocale() in R says this: "LC_MESSAGES" will be "C" on systems that do not support message translation, and is not supported on Windows.

To me this sounds like modifying character representation for R messages/errors can't be done on any Windows version...

Thomas
  • 11
  • 2