Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?

Question

I try to set my default encoding to UTF-8; up to now without success:

a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

options(encoding = "UTF-8")
a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

old_locale <- Sys.getlocale()
Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252")
a <- "Hallo"
b <- "äöfd"
print(Encoding(a))
# [1] "unknown"
print(Encoding(b))
# [1] "latin1"

Sys.getlocale()
# [1] "LC_COLLATE=German_Switzerland.1252;
# LC_CTYPE=German_Switzerland.1252;
# LC_MONETARY=German_Switzerland.1252;
# LC_NUMERIC=C;LC_TIME=German_Switzerland.1252"

I found the following links R Encoding for files and How to use Sys.setlocale() but as you can see it seems they don't work in my case and I don't understand why.

I also tried Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") but got

Warning message: In Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

In cmd the command systeminfo & pause gives

Systemgebietsschema: de-ch;Deutsch (Schweiz) Eingabegebietsschema: de-ch;Deutsch (Schweiz)

Edit:

I fear that "unknown" encoding could lead to mistakes which I am not aware and
I thought that it was good to use the new standard UTF-8 to avoid problems like the one I had.
Last but not least I would like to be able to get reproducible results - a colleague is working on a Mac (with less issues concerning encoding)...

Edit2: What is the experience with this issue? Is there any best practice?

I doubt that this is possible on a Windows system, but even if it is, you'd probably get into trouble quickly. Why do you want this? — Roland, Sep 22 '16 at 07:36
@Roland See my edit. I write the code in english (scripts and packages). As soon as I start to use the console involving data from customers I need the Umlaute and I run into trouble. — Christoph, Sep 22 '16 at 07:50
1) The problem you linked to was fixed in an R update. 2) I don't think Windows supports UTF-8 sufficiently to make it the default. (But I could be wrong.) 3) I'm German and problems with Umlaute (which are part of the default latin1 character set) are so rare that I can hardly remember the last time (I think it was related to data.table and its join optimizations, which I could switch off easily). — Roland, Sep 22 '16 at 07:51
Ok. I feared, I could run into more trouble. But then perhaps everything is fine. Could you write your comment as answer? What would you advice: Change imported data to local `latin1`? In case I could modify the title of the question (E.g. "Is it a problem if I don't have UTF-8 as locale?") — Christoph, Sep 22 '16 at 07:58
I don't feel sufficiently knowledgeable about this to post an answer. Encoding problems are one of my nightmares. I can just say that they have been very rare for me when using R. — Roland, Sep 22 '16 at 08:09
The bottom line is that UTF-8 and Windows don't play well, as noted by Roland. I have given up and have a dual boot into a lightweight linux distribution which handles this seamlessly. — Roman Luštrik, Sep 22 '16 at 09:57
@RomanLuštrik is it possible to pinpoint let's say the three most frequently met problems (which I may face in my constellation)? — Christoph, Sep 22 '16 at 11:57

Christoph · Accepted Answer · 2020-09-17T09:22:46.527

This is not a perfect answer but a good workaround: As Roland pointed out, it might be dangerous to change the locale. So leave it as is. If you have a file and you run into trouble, just search for non-UTF8 encoding as discribed here for RStudio. What I saw, most Editors have such a feature.

Furthermore, this answer gives more insight in what you can do in case you source() a file.

For a way to deal with locales when collations play a crucial part see here

Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?

1 Answers1

Linked