2

I have a bunch of text files with filenames that contain non-ASCII characters. For example this is a title:

readLines('bbb/ović, Melika_ Omeragić, Ismir_ Bata.txt')

## Error in file(con, "r") : cannot open the connection
## In addition: Warning message:
## In file(con, "r") :
##   cannot open file 'bbb/ovi?, Melika_ Omeragi?, Ismir_ Bata.txt': Invalid argument

I try:

dir('bbb')
## [1] "ovic, Melika_ Omeragic, Ismir_ Bata.txt"

So I tried:

readLines(list.files('bbb', full.names = TRUE))

## Error in file(con, "r") : cannot open the connection
## In addition: Warning message:
## In file(con, "r") :
##   cannot open file 'bbb/ovic, Melika_ Omeragic, Ismir_ Bata.txt': No such file or directory

How can I programatically read these files in? The content of the files is of no matter to this questions, just the special characters in the file names and reading the files in.

If need be if there's a way to changing the file names in order to read them in I'm open to that as well.

I realize I have no MWE but can't create one for this problem. Simply generating a text file and naming it: ović, Melika_ Omeragić, Ismir_ Bata.txt and using the code I have above to read it in will illustrate the problem.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • can you use a `system()` command to rename the files to valid names? – SymbolixAU Feb 03 '18 at 05:15
  • What OS? It works fine for me on a Mac. – alistaire Feb 03 '18 at 06:17
  • @alistaire Windows – Tyler Rinker Feb 03 '18 at 14:15
  • 1
    Hmm...I was looking at [fs](https://github.com/r-lib/fs), whose [`path_sanitize`](http://fs.r-lib.org/reference/path_sanitize.html) may be useful, and discovered [this description of restrictions](https://kb.acronis.com/content/39790). I don't think it's an answer, but maybe it points in a useful direction at least. – alistaire Feb 03 '18 at 21:56

3 Answers3

1

You can rename all the files from non ascii to simpler names using a single line of code :

file.rename(Sys.glob("*"),list.files())

Indeed, the function Sys.glob is similar to list.files but supports better non ascii characters.

If you want to do this renaming recursively in multiple subfolders, I recommend using the fs package (functions file_move and dir_ls). For a little more info, maybe check my answer other there : Reading accented filenames in R using list.files .

Then readLines should work fine, but without special characters :-)

Dr_Ruben
  • 41
  • 3
  • Clobbering your filesystem doesn't seem like a great solution – Hong Ooi Sep 12 '21 at 20:32
  • @HongOoi Yeah I can understand that, it depends on your endgoal. For my project I had to rename all the files, and in the question OP said "If need be if there's a way to changing the file names in order to read them in I'm open to that as well.", that's why I suggested this. – Dr_Ruben Sep 13 '21 at 09:10
0

I am able to read a file in with the name ović, Melika_ Omeragić, Ismir_ Bata.txt, using readr's read_lines_raw. The byte sequence even seems to match the text inside, which is a good thing.

#file on my desktop
path <- '~/Desktop/ović, Melika_ Omeragić, Ismir_ Bata.txt'
##Assumming the file contains the word 'foobar'
x <- charToRaw('foobar')


#Using readr
n <- readr::read_lines_raw(path)
print(n)
[[1]]
[1] 66 6f 6f 62 61 72

print(x)
[1] 66 6f 6f 62 61 72

Hope this helps.

petergensler
  • 342
  • 2
  • 8
  • 23
0

The thing in Windows is pretty tricky but I was able to find a workaround using this posts:

equivalent of (dir/b > files.txt) in PowerShell

R: can't read unicode text files even when specifying the encoding

The idea I use to read the file is write its name in a file a read it from there with the appropriate encoding.

My solution is as follows (I use here library only for reproducibility reasons):

libarary(here)

obtain.files <- function(folder){
  # Obtain all files in folder and write output into file
  system(paste0("cmd /K ",'cd /d "',folder,'/" &  cmd /u /c "dir /b > filestmp.txt"'))
  tmpfilepath <- paste0(folder,"/filestmp.txt")
  # Read temporal file 
  # Not sure it will work in all windows versions
  RL<-readLines(con <- file(tmpfilepath,encoding="UCS-2LE"))

  # Remove file
  file.remove(tmpfilepath)
  # Keep only valid files
  RL <- RL[RL!="filestmp.txt"]
  return(RL)
}

folder <- here::here("bbb")
# There is only one file in the folder
files <- obtain.files(folder)

readLines(here::here("bbb",files))

I used the cmd command found in the first post and the output was in UCS-2LE. It might not be platform independent. With powershell the filetmp.txt was in UTF-16 and probably is a more general example.

Jon Nagra
  • 1,538
  • 1
  • 16
  • 36