Can't write some unicode characters to file

Question

Let's consider the following code:

> cat('\u2077\u2078\u2079 \u2087\u2088\u2089')
⁷⁸⁹ ₇₈₉
> out <- file("out.txt", "w", encoding = 'utf-8')
> cat('\u2077\u2078\u2079 \u2087\u2088\u2089', file=out)
> close(out)

the content of out.txt is:

78<U+2079> 789

The sub/superscript form is lost and for exponent 9 it's the codepoint that is printed.

What's happening here? How can I have the correct form of the characters in the file as they are printed in the RStudio console?

Versions: RStudio 1.1.436 / R 3.5.2 / Windows 10

I'm pretty sure this question is a dup, but I can't find it. The issue is that R switches to the local encoding before converting to the requested one, and those characters don't exist there: so get mangled. One solution is to make sure the string is UTF-8 (it should be since you used `\u`), but declare it as "unknown", so it won't be translated on output. Then you don't need to declare the encoding of the output file. — user2554330, Dec 21 '18 at 15:40

score 1 · Accepted Answer · answered Dec 21 '18 at 16:30

1

Aargh, windows and UTF-8!

I've been puzzling as well, and this works for me

options(encoding='native.enc')
out <- file('out.txt', open='w', encoding = 'UTF-8')
writeLines('\u2077\u2078\u2079 \u2087\u2088\u2089', 'out.txt', useBytes = TRUE)
close(out)
readback <- readLines('out.txt', encoding='UTF-8')

My setup is a bit older (my most used setup is OSX): Rstudio 0.99.903/R 3.3.1/Windows 7

The very strangest thing I've encountered is that it stops working if you set options(encoding='UTF-8')

And finally, I noticed all mentions of UTF-8 are in uppercase, I see you used lowercase, I'm not sure if that makes a difference.

answered Dec 21 '18 at 16:30

Emil Bode

1,784
8
16

The example code opens a connection to `"out.txt"` but the call to `writeLines()` does not use the connection, which does not make sense. If the connection was actually used, I would suggest using `encoding = ""` or `encoding = "native.enc"` as an argument to `file()` to bypass character encoding translation. – mvkorpel Dec 31 '18 at 10:08
I know it does not make sense, and this is not the way it SHOULD be written. I just know that every "logical" solution I tried didn't work, and this does. – Emil Bode Jan 02 '19 at 13:49
Looks like the winning combo is `options(encoding='native.enc')` and `useBytes = TRUE`. – gregseth Jan 10 '19 at 09:09

Can't write some unicode characters to file

1 Answers1