(related to How can I make R maintain utf8 encodings? but not quite the same issue; also related to Force character vector encoding from "unknown" to "UTF-8" in R but not exactly the same, as in my case the conversion seems to happen upon creating a tibble, not during file read/write)
Consider the following tibble:
foo <- tibble(aa=c("ℬ", "", "", "ℱ", "γ","δ","ε","φ"),
uu=c("\U0000212c","\U0001d49e","\U0001d49f","\U00002131","\U000003b3","\U000003b4","\U000003b5","\U000003c6"))
Best I can tell, column uu contains the utf-8 codes for the symbols in aa.
In Rstudio console,
> foo
# A tibble: 8 x 2
aa uu
<chr> <chr>
1 "B" "B"
2 "\U0001d49e" "\U0001d49e"
3 "\U0001d49f" "\U0001d49f"
4 "F" "F"
5 "<U+03B3>" "<U+03B3>"
6 "d" "d"
7 "e" "e"
8 "f" "f"
The two columns appear to be identical (as we will see, they aren't). Some characters have been converted to their utf code, and some have been converted to a near equivalent, for instance the gamma was recognized as an utf character and stayed a gamma, but the delta was converted to a d.
> sapply(foo,Encoding)
aa uu
[1,] "unknown" "UTF-8"
[2,] "UTF-8" "UTF-8"
[3,] "UTF-8" "UTF-8"
[4,] "unknown" "UTF-8"
[5,] "UTF-8" "UTF-8"
[6,] "unknown" "UTF-8"
[7,] "unknown" "UTF-8"
[8,] "unknown" "UTF-8"
Ok. So we actually have two issues. The first one is that some utf characters of col. uu are not rendered by the print() method, for instance the delta; but R still knows that this is a utf delta and not really a d. Not a big deal.
The second one is that some characters have been transformed into their nearest ascii equivalent. The delta in col. aa, for instance is converted to a real d. But the gamma is not !
We can check this by saving into a csv file:
write_csv(foo,"foo.csv")
and now, in a text editor:
or in excel (yes, I did import using the data source wizard and explicitly told it was utf-8):
The text editor (it's notepad++) lacks some of the glyphs, but it seems clear that column uu contains utf glyphs everywhere, whereas in aa some have been converted to ascii near-equivalents. Excel has all the glyphs and renders uu correctly, in aa we still see that some characters (delta, epsilon, phi and the weirdest script capitals) were converted to ascii near-equivalents.
So, the question is double:
- Why are some UTFs converted to ascii (e.g. delta, <U+03B4>) but not some others (e.g. gamma, <U+03B3>) ? How can I know which one is safe ? Incidentally, why are some rendered as <U+03B3> and some as "\U0001d49f" ?
- How can I prevent this from happening - Ok, the answer is in my question, by explicitely giving the full UTF code...
This is all with French locales, R 4.1 and Rstudio 1.4 but I don't think it makes a huge difference here.