I get .xlsx
files from various sources, to read and analyse the data in R, working great. Files are big, 10+ MB. So far, readxl::read_xlsx
was the only solution that worked. xlsx::read.xls
produced only error messages: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded)
Problem: some files have non-standard letters in the filename, e.g. displayed in Windows 10/explorer as '...ü...xlsx' (the character 'ü' somewhere in the filename). When I read all filenames in the folder in R, I get '...u"...xlsx'). I check for doublettes of the filenames from different folders before I actualle read the files. However, when it comes to read the above file, I get an error message '... file does not exist', no matter if I use
- the path/filename character variable directly obtained from list.files (showing '...u"...xlsx')
- the string constant '...u"...xlsx'
- the string constant '...ü...xlsx'
As far as I understand, the problem arises from aequivalent, yet not identical, unicode compositions. I have no influence on how these characters are originally encoded. Therefore I see no way to read the file, other than (so far manually) rename the file in Windows explorer, changing an 'ü' coded as 'u+"' to 'ü'.
Questions:
- is there a workaround within R? (keep in mind the requirement to use read_xlsx, unless a yet unknown package works with huge files.
- if not possible within R, what would be the best option to change filenames automatically ('u+"' to 'ü') - I need to keep the 'ü' (or ä, ö, and others) in order to connect the analysis results back to the input), preferrably without additional (non-standard) software (e.g. command shell).
EDIT:
To read the list of files, dir_ls works (as suggested), but it returns an even stranger filename: 'ö' instead of 'ö', which in turn cannot be read (found) by read_xlsx either.