0

Let's assume I have a large amount of *.rds files with some have UTF-8 characters in their path. For some reason R can't handle some special accents. For example enc2utf8("Č"), which should print "Č" but on my end it converts to 'C" which makes it impossible for R to recognize the file. Any ideas how to handle such cases/help R with the encoding?

Session info output :

>session.info()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.7.9 here_0.1        forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4    
 [7] readr_1.3.1     tidyr_1.1.2     tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       cellranger_1.1.0 pillar_1.4.6     compiler_4.0.2   dbplyr_1.4.4     tools_4.0.2     
 [7] jsonlite_1.7.2   lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.10     reprex_0.3.0    
[13] cli_2.4.0        DBI_1.1.0        rstudioapi_0.13  haven_2.3.1      withr_2.4.2      xml2_1.3.2      
[19] httr_1.4.2       fs_1.5.0         generics_0.1.0   vctrs_0.3.3      hms_0.5.3        rprojroot_1.3-2 
[25] neuralnet_1.44.2 grid_4.0.2       tidyselect_1.1.0 glue_1.4.2       R6_2.4.1         readxl_1.3.1    
[31] modelr_0.1.8     blob_1.2.1       magrittr_1.5     backports_1.1.9  scales_1.1.1     ellipsis_0.3.1  
[37] rvest_0.3.6      assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.6    munsell_0.5.0    broom_0.7.0     
[43] crayon_1.3.4   

@EDIT I :

Clarification : R can't read the file path due to UTF-8 characters in the file name.

Original file path example : G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe/POLJAŃSKI_Paweł_sprinter_point.rds

Neither readRDS from base nor read_rds from the readr package can encode the path correctly.

Both produce the following error :

Error in gzfile(file, "rb") : cannot open the connection In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '

G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe/POLJANSKI_Pawel_sprinter_point.rds', probable reason 'No such file or directory

I don't load the paths with a sourced *.txt file but have a function which creates a list of files in given directories.

This function prints the file path correctly. So it's not a problem with my way to concatenate the path-string .

 str_c(outputDIR_pro[i],
                   sub(".+/data/Strava/.+/([0-9]+?).txt", "\\1", athlethes[[i]][[j]]) %>% str_match('\\d+') %>% 
                    str_detect(names_id_vec,.) %>%
                     names_id_vec[.] %>%
                     str_remove('\\d+;'),'_sprinter_point', '.rds') # %>% readRDS
[1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"
mugdi
  • 365
  • 5
  • 17
  • did you make sure it works as expected without special characters? – Waldi Sep 07 '21 at 12:07
  • 1
    Of course. If I rename the file to the path which is encoded by `readRDS()` error message, the function result is as expected. – mugdi Sep 07 '21 at 17:31
  • Which operating system? Note: I think it is wrong to assume filenames are UTF-8 (I do not remember operating system API which prescribe such encoding). If you read the filename from OS, just do not encode it again. – Giacomo Catenazzi Sep 08 '21 at 06:09
  • The OS is specified in the session output in the question :). It is Win 10 x64. – mugdi Sep 08 '21 at 08:04

3 Answers3

2

At first I thought your locale was the problem; windows-1252 doesn't contain "Ń". But I couldn't reproduce your error even with filenames like ".rds" with latin1 encoding and german locale.

But the amount of whitespace in your error was more that I got for files that didn't exist... Then I spotted the leading space in your example output.

[1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"

That could explain why it prints "okay" (we don't see whitespace), but trying to read would fail. It does leave me puzzled about why your other files read without problem.

If that isn't the problem than it may be the relative recent support for utf-8 in Windows. Historically they have used ucs-2 and utf-16 internally. "Turning on" utf-8 support requires a different C runtime. There is an experimental build of R that you could try out that uses that runtime. But that requires you to rebuild your libraries (readr!) with that runtime too.

Before messing up your whole R installation, I'd test with the experimental build if you can read a file called Ń.csv.

Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
  • I will give it a try. I didn't recognize the whitespace so far. It makes sense that a wrong path can't lead to an probably existing file. But like you said, how can this exception lead to a behaviour which destroy the path by adding whitespace to a part of the path which does not even have an utf-8 character at that exact position. By the way. In the end I helped myself with an simple `C#` -script to replace all utf-8 characters with an underscore. Surely, this can be problematic creating indistinguishable paths, but lucky for me, in my case it didn't. – mugdi Sep 09 '21 at 08:04
2

I think I have the solution to the original problem, although I do not fully understand why this solution works. I'm using R version 3.6.1 on Windows 10 (64-bit) with locale "English_United States.1252".

readRDSunicode <- function(filename) {
  stopifnot(is.character(filename), Encoding(filename) == "UTF-8")
  z <- file(filename, open = "rb")
  conn <- gzcon(z)
  readRDS(conn)
}

filename <- "POLJA\U0143SKI_Pawe\U0142_sprinter_point.rds"
setwd("C:/Users/Public/Documents/R Working Directory/test unicode in filenames")
xx <- readRDSunicode(filename)

The trick is to use gzcon instead of using gzfile. I got that from manual page UTF8filepaths which I think might be new in R version 4. The "Windows" section of that page notes that gzfile (and a few other functions) cannot access paths that are not in the current encoding. The page goes on to say, "For functions using gzfile (including load, readRDS, read.dcf and tar), it is often possible to use a gzcon connection wrapping a file connection."

Also, when specifying the filename you need to use \U notation. It doesn't work if you put filename <- "POLJAŃSKI_Paweł_sprinter_point.rds" (I assume that would work if your locale were Polish.) To get the hex codes for the special characters, I went to Richard Tobin's UTF-8 conversion tool.

There's a question from 2014 that raises a similar issue of Windows filenames with characters that are not in the native encoding. The answers there have some useful tips, such as using the function Sys.glob(paths="*") instead of dir().

I discussed a related issue, about using special characters in R for Windows, here.

Various bug reports have been filed about the treatment of UTF-8-encoded strings in R for Windows, including 11515, 14271, 15762, 16064, 16101, and 16232.

Montgomery Clift
  • 445
  • 3
  • 13
0

Not sure if I understand it correctly. Is the issue with the file paths of the Rds files containing non-ascii characters, or is it with the contents of the Rds files containing strings with non-ascii characters?

If the issue is with file paths and you are manually typing in these texts/paths into text files which you then load through e.g. source, the issue is likely that the text files are not being saved with the correct encoding, in which case you can use unicode notation directly. Example:

cat("\u100")

If the issue is with the contents of the files and these Rds files were generated elsewhere and you are sure that they contain the correct text (e.g. if you try cat(variable) it shows what's supposed to show), they should work correctly with base R IO functions, but for other packages you might need to convert them with enc2native. Note that not all CRAN packages which do IO might support non-ascii characters on windows.

What exactly is failing?

anymous.asker
  • 1,179
  • 9
  • 14