I tried to get a file list in a focal directory by using dir
function on Windows 11, but I found that it fails to get all files without any error/warning in some cases. Upon my further investigation (see the following code in detail), dir
fail to get all files when there is a file whose name is very long (apparently more than 259 bytes). dir
seems to stop the further search when they reach to the file with very long name, and so the other files listed on after that file are also not found even if their names are short. Interestingly, even when dir
fail to detect the file with very long name, file.exists
still detects its existence correctly.
ADD 2023.08.30 12:44 UTC
I tried the following code in R 4.3.1 on Windows 11. It causes crash in R 4.2.2.
#create test directory
dir.create("test",showWarnings = FALSE)
# step 0 ----------------------------------------------------------------------
#generate 01_test.txt, 02_test.txt, ...
for(i in c(1,2,3,11,12,13)){
file.create(sprintf("test/%02d_test.txt",i))
}
# dir find all six files.
dir("test")
# step 1 ----------------------------------------------------------------------
# use unicode character
# on windws, file names are coded in UTF8 (?)
# \3042 is a Japanese kana character.
longname = paste(c("test/05_test",rep("\u3042",times=10),".txt"),collapse = "")
file.create(longname)
# Again, dir find all (now seven) files.
dir("test")
# step 2 ----------------------------------------------------------------------
# try longer file name
# in UTF8, u3042 is 3 bytes, so "05_test"(7 bytes)+ 83*3bytes + ".txt" 4bytes = 260 bytes
morelongname = paste(c("test/05_test",rep("\u3042",times=83),".txt"),collapse = "")
file.create(morelongname)
# dir fail to find all (now eight) files.
# dir seems to stop searching files which appear after the "morelongname" file,
# i.e., 11_test.txt, 12_test.txt, 13_test.txt are also not not found.
dir("test",full.names = TRUE)
# however, file.exists can detect the existence.
file.exists(morelongname)
# file.remove also works well
file.remove(morelongname)
# step 3 ----------------------------------------------------------------------
# try one-byte shoter name
# "05_tes"(6 bytes)+ 83*3bytes + ".txt" 4bytes = 259 bytes
morelongname2 = paste(c("test/05_tes",rep("\u3042",times=83),".txt"),collapse = "")
file.create(morelongname2)
# works well, so 260 bytes seems to be the border.
dir("test",full.names = TRUE)
I know that Windows limits file names to 260 characters, but it should only be the number of characters and not the number of bytes.
My question is
- Is this expected normal behavior of
dir
? Or is it a bug? - It is difficult to identify in advance from R that there is a file with a long name in the folder. It is also difficult to discover that it has failed because it gives no error or warning. If the file name is extremely long. Do you have any idea how to detect/avoid this problem? In my case, it is okay to ignore the file with very long name because it is very rare cases.
ADD 2023.08.31 01:51 UTC
I tried two solutions.
One is the use of system
function with intern option; which return the output of command as character.
system("dir test",intern=TRUE) |>
stringr::str_split("\\s+") |>
purrr::flatten_chr()
It works partially, but fail to read unicode characters.
[1] "01_test.txt"
[2] "02_test.txt"
[3] "03_test.txt"
[4] "05_test\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202.txt"
[5] "05_test\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201"
[6] "11_test.txt"
[7] "12_test.txt"
[8] "13_test.txt"
\343\201\202 is UTF8 three codes of \u3042, so I guess this is the problem of encode, but anyway I could not solved this.
The other is suggested by @Jean-Claude Arbaut in the comment; the call of python from reticulate package.
os = reticulate::import("os")
os$listdir("test")
Although there is small overhead when we import os module, it works well! At least in my case, this method is likely solution.