2

(related to How can I make R maintain utf8 encodings? but not quite the same issue; also related to Force character vector encoding from "unknown" to "UTF-8" in R but not exactly the same, as in my case the conversion seems to happen upon creating a tibble, not during file read/write)

Consider the following tibble:

foo <- tibble(aa=c("ℬ", "", "", "ℱ", "γ","δ","ε","φ"),  
   uu=c("\U0000212c","\U0001d49e","\U0001d49f","\U00002131","\U000003b3","\U000003b4","\U000003b5","\U000003c6"))

Best I can tell, column uu contains the utf-8 codes for the symbols in aa.

In Rstudio console,

> foo
# A tibble: 8 x 2
  aa           uu          
  <chr>        <chr>       
1 "B"          "B"         
2 "\U0001d49e" "\U0001d49e"
3 "\U0001d49f" "\U0001d49f"
4 "F"          "F"         
5 "<U+03B3>"          "<U+03B3>"         
6 "d"          "d"         
7 "e"          "e"         
8 "f"          "f" 

The two columns appear to be identical (as we will see, they aren't). Some characters have been converted to their utf code, and some have been converted to a near equivalent, for instance the gamma was recognized as an utf character and stayed a gamma, but the delta was converted to a d.

> sapply(foo,Encoding)
     aa        uu     
[1,] "unknown" "UTF-8"
[2,] "UTF-8"   "UTF-8"
[3,] "UTF-8"   "UTF-8"
[4,] "unknown" "UTF-8"
[5,] "UTF-8"   "UTF-8"
[6,] "unknown" "UTF-8"
[7,] "unknown" "UTF-8"
[8,] "unknown" "UTF-8"

Ok. So we actually have two issues. The first one is that some utf characters of col. uu are not rendered by the print() method, for instance the delta; but R still knows that this is a utf delta and not really a d. Not a big deal.

The second one is that some characters have been transformed into their nearest ascii equivalent. The delta in col. aa, for instance is converted to a real d. But the gamma is not !

We can check this by saving into a csv file:

write_csv(foo,"foo.csv")

and now, in a text editor:

enter image description here

or in excel (yes, I did import using the data source wizard and explicitly told it was utf-8):

enter image description here

The text editor (it's notepad++) lacks some of the glyphs, but it seems clear that column uu contains utf glyphs everywhere, whereas in aa some have been converted to ascii near-equivalents. Excel has all the glyphs and renders uu correctly, in aa we still see that some characters (delta, epsilon, phi and the weirdest script capitals) were converted to ascii near-equivalents.

So, the question is double:

  1. Why are some UTFs converted to ascii (e.g. delta, <U+03B4>) but not some others (e.g. gamma, <U+03B3>) ? How can I know which one is safe ? Incidentally, why are some rendered as <U+03B3> and some as "\U0001d49f" ?
  2. How can I prevent this from happening - Ok, the answer is in my question, by explicitely giving the full UTF code...

This is all with French locales, R 4.1 and Rstudio 1.4 but I don't think it makes a huge difference here.

jfmoyen
  • 495
  • 2
  • 11
  • 1
    The quick answer is to wait for R 4.2 (or install the development version if you can't wait) which introduces native Unicode support for Windows. R's Unicode support on Windows has longstanding issues which the next version will rectify. – Ritchie Sacramento Mar 29 '22 at 11:59
  • See https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/ – Ritchie Sacramento Mar 29 '22 at 12:05
  • Oooh sweet ! This will help I'm sure. Meanwhile spelling out the full UTF codes does work, so it is a workaround if not a fix. I take the first question ("why?") becomes somewhat moot, too... – jfmoyen Mar 29 '22 at 13:36

0 Answers0