R: filename list result not recognized for actually reading the file (filename character encoding problem)

Question

I get .xlsx files from various sources, to read and analyse the data in R, working great. Files are big, 10+ MB. So far, readxl::read_xlsx was the only solution that worked. xlsx::read.xls produced only error messages: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded)

Problem: some files have non-standard letters in the filename, e.g. displayed in Windows 10/explorer as '...ü...xlsx' (the character 'ü' somewhere in the filename). When I read all filenames in the folder in R, I get '...u"...xlsx'). I check for doublettes of the filenames from different folders before I actualle read the files. However, when it comes to read the above file, I get an error message '... file does not exist', no matter if I use

the path/filename character variable directly obtained from list.files (showing '...u"...xlsx')
the string constant '...u"...xlsx'
the string constant '...ü...xlsx'

As far as I understand, the problem arises from aequivalent, yet not identical, unicode compositions. I have no influence on how these characters are originally encoded. Therefore I see no way to read the file, other than (so far manually) rename the file in Windows explorer, changing an 'ü' coded as 'u+"' to 'ü'.

Questions:

is there a workaround within R? (keep in mind the requirement to use read_xlsx, unless a yet unknown package works with huge files.
if not possible within R, what would be the best option to change filenames automatically ('u+"' to 'ü') - I need to keep the 'ü' (or ä, ö, and others) in order to connect the analysis results back to the input), preferrably without additional (non-standard) software (e.g. command shell).

EDIT:

To read the list of files, dir_ls works (as suggested), but it returns an even stranger filename: 'oÌˆ' instead of 'ö', which in turn cannot be read (found) by read_xlsx either.

some issues with `readxl` to keep in mind https://github.com/tidyverse/readxl/issues/598 — cephalopod, May 10 '20 at 00:01

score 0 · Answer 1 · answered May 09 '20 at 23:57

0

try using the fs library. My workflow looks something like this:

library(tidyverse)
library(lubridate)
library(fs)
library(readxl)


directory_to_read <- getwd()

file_names_to_read <- dir_ls(path = directory_to_read, 
       recurse = FALSE,              # set this to TRUE to read all subdirectories 
       glob = "*.xls*",             
       ignore.case = TRUE) %>%           # This is to ignore upper/lower case extensions

# Use this to weed out temp files - I constantly have this probles
    str_subset(string = ., 
               regex(pattern = "\\/~\\$", ignore_case = TRUE),  #use \\ before $ else it will not work
               negate = TRUE)        # TRUE Returns non-matching patterns


map(file_names_to_red[4], read_excel)

answered May 09 '20 at 23:57

cephalopod

1,826
22
31

Thanks. To read the directory file list, dir_ls (i.e. the tidyverse version) works, but returns an even stranger version of the file name: 'oÌˆ' instead of 'ö', which is unfortunately of no help for read_xlsx and read_excel (both complain they connot find such a file). Did I miss a point in the answer? – Martin May 10 '20 at 01:18
can you try `dir_ls() %>% map_lgl(file_exists)` and pot your reply back in here – cephalopod May 10 '20 at 02:59
file_exists thinks that the file exists: a) pipe notation: ... file_exists... => '...oÌˆ...xlsx" TRUE b) file_exists(file_names_to_read) => TRUE as well, but displays (echo) as '...ö...xlsx, as does file_names_to_read ('...ö...xlsx') However: read_excel/read_xlsx does not, error message: Error: Evaluation error: zip file '...o¨...xlsx' cannot be opened (i.e. after 'ö' and 'oÌˆ now the 3rd version as 'o"'), no matter if evoked directly as function, or using map or map_lgl. – Martin May 10 '20 at 14:58
Something similar happened to me in the past. The issue was the downloaded file was pretending to be an xl file. I could open the file in XL but not with R. To solve this, I wrote a macro in XL to open and re-save the file. I then called this macro from R making it a smooth workflow. As a start can you open/save file in xl and then read from R. – cephalopod May 11 '20 at 00:52
I can read the file in R, I don't have to change anything of the file content, but only after changing the file name: put cursor right next to the letter pretending to be an 'ö, 'ü' or whatever non-standard. Press backspace, no character is deleted, but 'ö' turns to 'o' (probably the right invisible character of 'o"' was deleted). Delete also 'o', and press 'ö' on keybord. After that, can raed by R. – Martin May 11 '20 at 17:16
are you on windows? In any case then use `file.rename` to update file name via script and read. You might find `basename` useful. – cephalopod May 12 '20 at 01:55
to detect special characters see https://stackoverflow.com/questions/36928870/r-check-if-string-contains-special-characters – cephalopod May 12 '20 at 02:30

R: filename list result not recognized for actually reading the file (filename character encoding problem)

1 Answers1