8

I try to set my default encoding to UTF-8; up to now without success:

a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

options(encoding = "UTF-8")
a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

old_locale <- Sys.getlocale()
Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252")
a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

Sys.getlocale()
# [1] "LC_COLLATE=German_Switzerland.1252;
# LC_CTYPE=German_Switzerland.1252;
# LC_MONETARY=German_Switzerland.1252;
# LC_NUMERIC=C;LC_TIME=German_Switzerland.1252"

I found the following links R Encoding for files and How to use Sys.setlocale() but as you can see it seems they don't work in my case and I don't understand why.

I also tried Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") but got

Warning message: In Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

In cmd the command systeminfo & pause gives

Systemgebietsschema: de-ch;Deutsch (Schweiz) Eingabegebietsschema: de-ch;Deutsch (Schweiz)

Edit:

  • I fear that "unknown" encoding could lead to mistakes which I am not aware and
  • I thought that it was good to use the new standard UTF-8 to avoid problems like the one I had.
  • Last but not least I would like to be able to get reproducible results - a colleague is working on a Mac (with less issues concerning encoding)...

Edit2: What is the experience with this issue? Is there any best practice?

Community
  • 1
  • 1
Christoph
  • 6,841
  • 4
  • 37
  • 89
  • I doubt that this is possible on a Windows system, but even if it is, you'd probably get into trouble quickly. Why do you want this? – Roland Sep 22 '16 at 07:36
  • @Roland See my edit. I write the code in english (scripts and packages). As soon as I start to use the console involving data from customers I need the Umlaute and I run into trouble. – Christoph Sep 22 '16 at 07:50
  • 2
    1) The problem you linked to was fixed in an R update. 2) I don't think Windows supports UTF-8 sufficiently to make it the default. (But I could be wrong.) 3) I'm German and problems with Umlaute (which are part of the default latin1 character set) are so rare that I can hardly remember the last time (I think it was related to data.table and its join optimizations, which I could switch off easily). – Roland Sep 22 '16 at 07:51
  • Ok. I feared, I could run into more trouble. But then perhaps everything is fine. Could you write your comment as answer? What would you advice: Change imported data to local `latin1`? In case I could modify the title of the question (E.g. "Is it a problem if I don't have UTF-8 as locale?") – Christoph Sep 22 '16 at 07:58
  • I don't feel sufficiently knowledgeable about this to post an answer. Encoding problems are one of my nightmares. I can just say that they have been very rare for me when using R. – Roland Sep 22 '16 at 08:09
  • The bottom line is that UTF-8 and Windows don't play well, as noted by Roland. I have given up and have a dual boot into a lightweight linux distribution which handles this seamlessly. – Roman Luštrik Sep 22 '16 at 09:57
  • @RomanLuštrik is it possible to pinpoint let's say the three most frequently met problems (which I may face in my constellation)? – Christoph Sep 22 '16 at 11:57

1 Answers1

0

This is not a perfect answer but a good workaround: As Roland pointed out, it might be dangerous to change the locale. So leave it as is. If you have a file and you run into trouble, just search for non-UTF8 encoding as discribed here for RStudio. What I saw, most Editors have such a feature.

Furthermore, this answer gives more insight in what you can do in case you source() a file.

For a way to deal with locales when collations play a crucial part see here

Christoph
  • 6,841
  • 4
  • 37
  • 89