0

I have tried to change the encoding of a tibble from "unknown" to "UTF-8", but it remains "unknown" (the data for the tibble are imported from Excel). The elements are all German strings. The first is "Die Periode ist im Format YYYY, wobei YYYY auf das Jahr verweist, z.B. 2020."

legend <- c("Die Periode ist im Format YYYY, wobei YYYY auf das Jahr verweist, z.B. 2020.", "Institutioneller Sektor")   
Encoding(legend)
    [1] "unknown" "unknown" 

I tried the following to see if I can change the encoding

Encoding(legend) <- "UTF-8"

But the encoding remains the same. If I try to recode the following example I found on the internet

x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8")
Encoding(xx)

this works fine:

> Encoding(xx)
[1] "UTF-8"

Furthermore, I tried

library(stringi)
legend$DE <- stri_encode(legend, "", "UTF-8")  

The encoding remains "unknown".

Another thing I tried was

write.csv(legend, file = "legend.csv", fileEncoding = "UTF-8")
legend <- read.csv("legend.csv", fileEncoding = "UTF-8")
Encoding(legend)

But the encoding remains "unknown".

My session info

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.3.2 usethis_2.0.1 

loaded via a namespace (and not attached):
 [1] rstudioapi_0.14   magrittr_2.0.3    pkgload_1.2.0     R6_2.5.1          rlang_1.1.0       fastmap_1.1.0     tools_4.0.4      
 [8] pkgbuild_1.4.0    sessioninfo_1.2.2 cli_3.6.0         withr_2.5.0       ellipsis_0.3.1    fortunes_1.5-4    remotes_2.4.2    
[15] rprojroot_2.0.3   lifecycle_1.0.3   crayon_1.5.2      brio_1.1.3        processx_3.4.5    purrr_1.0.1       callr_3.5.1      
[22] vctrs_0.6.1       fs_1.5.0          ps_1.6.0          testthat_3.1.3    memoise_2.0.1     glue_1.6.2        cachem_1.0.4     
[29] compiler_4.0.4    desc_1.4.2        prettyunits_1.1.1

Any help would be very appreciated.

Renger

arnyeinstein
  • 669
  • 1
  • 5
  • 14
  • Hello, could you please include a minimal reproducible example (a few lines of your real dataset, or something mocked up)? – Paul Stafford Allen Jun 16 '23 at 09:05
  • I added an example, but I think that this will not reproduce the problem as it looks like this is specific to my computer. If you look at the encoding of "legend" on your computer, it will probably not show "unknown". – arnyeinstein Jun 16 '23 at 09:26
  • 1
    Is there a specific reason to use R 4.0.4? R uses UTF-8 as the native encoding on Windows since 4.2. Non-UTF8 locale looks suspicious too, or is this something common for US locales? What do you get for `readr::guess_encoding("legend.csv")` – margusl Jun 16 '23 at 09:31
  • I am working for the government and we are still trying to get version 4.2 installed, but this will take another few months... readr::guess_encoding("legend.csv") encoding confidence 1 UTF-8 1 2 ISO-8859-1 0.52 3 ISO-8859-2 0.29 – arnyeinstein Jun 16 '23 at 09:38
  • I changed back to the Swiss_german locale, but this did not change the problem [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C [5] LC_TIME=German_Switzerland.1252 – arnyeinstein Jun 16 '23 at 09:47
  • Is the result from `readr::read_csv("legend.txt", locale = locale(encoding = "UTF-8"))` any different from what you'd get from `read.csv()`? – margusl Jun 16 '23 at 10:03
  • Related: https://stackoverflow.com/questions/76113028/how-to-convert-all-txt-files-in-a-folder-from-utf16-to-utf8-that-they-can-be-re/76148901#76148901 – SamR Jun 16 '23 at 10:15
  • If I add some German umlauts ("ü") in the text, read.csv with encodig = "UTF-8" produces the correct text.Wweird characters pop up when I read it without encoding. ("über") – arnyeinstein Jun 16 '23 at 10:27
  • @SamR: This is not a solution, as I read the data from Excel and the resulting encoding is "unknown". In R I can't change the encoding from the dataframe from "unknown" to "UTF-8". – arnyeinstein Jun 16 '23 at 10:32
  • @arnyeinstein if it was a solution I'd have voted to close your question as a duplicate. I thought one of the variety of approaches might be helpful. – SamR Jun 16 '23 at 10:34

0 Answers0