4

I tried to get a file list in a focal directory by using dir function on Windows 11, but I found that it fails to get all files without any error/warning in some cases. Upon my further investigation (see the following code in detail), dir fail to get all files when there is a file whose name is very long (apparently more than 259 bytes). dir seems to stop the further search when they reach to the file with very long name, and so the other files listed on after that file are also not found even if their names are short. Interestingly, even when dir fail to detect the file with very long name, file.exists still detects its existence correctly.


ADD 2023.08.30 12:44 UTC

I tried the following code in R 4.3.1 on Windows 11. It causes crash in R 4.2.2.


#create test directory
dir.create("test",showWarnings = FALSE)

# step 0 ----------------------------------------------------------------------
#generate 01_test.txt, 02_test.txt, ...
for(i in c(1,2,3,11,12,13)){
    file.create(sprintf("test/%02d_test.txt",i))
}
# dir find all six files.
dir("test")

# step 1 ----------------------------------------------------------------------
# use unicode character
#   on windws, file names are coded in UTF8 (?) 
#   \3042 is a Japanese kana character.
longname = paste(c("test/05_test",rep("\u3042",times=10),".txt"),collapse = "")
file.create(longname)

# Again, dir find all (now seven) files.
dir("test")

# step 2 ----------------------------------------------------------------------
# try longer file name
# in UTF8, u3042 is 3 bytes, so "05_test"(7 bytes)+ 83*3bytes + ".txt" 4bytes = 260 bytes
morelongname = paste(c("test/05_test",rep("\u3042",times=83),".txt"),collapse = "")
file.create(morelongname)

# dir fail to find all (now eight) files.
#   dir seems to stop searching files which appear after the "morelongname" file, 
#   i.e., 11_test.txt, 12_test.txt, 13_test.txt are also not not found. 
dir("test",full.names = TRUE)

# however, file.exists can detect the existence. 
file.exists(morelongname)

# file.remove also works well 
file.remove(morelongname)

# step 3 ----------------------------------------------------------------------
# try one-byte shoter name
# "05_tes"(6 bytes)+ 83*3bytes + ".txt" 4bytes = 259 bytes
morelongname2 = paste(c("test/05_tes",rep("\u3042",times=83),".txt"),collapse = "")
file.create(morelongname2)

# works well, so 260 bytes seems to be the border.
dir("test",full.names = TRUE)

I know that Windows limits file names to 260 characters, but it should only be the number of characters and not the number of bytes.

My question is

  1. Is this expected normal behavior of dir? Or is it a bug?
  2. It is difficult to identify in advance from R that there is a file with a long name in the folder. It is also difficult to discover that it has failed because it gives no error or warning. If the file name is extremely long. Do you have any idea how to detect/avoid this problem? In my case, it is okay to ignore the file with very long name because it is very rare cases.

ADD 2023.08.31 01:51 UTC

I tried two solutions.

One is the use of system function with intern option; which return the output of command as character.

system("dir test",intern=TRUE) |> 
    stringr::str_split("\\s+") |> 
    purrr::flatten_chr()

It works partially, but fail to read unicode characters.

[1] "01_test.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[2] "02_test.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[3] "03_test.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[4] "05_test\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[5] "05_test\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201\\202\\343\\201"
[6] "11_test.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[7] "12_test.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[8] "13_test.txt" 

\343\201\202 is UTF8 three codes of \u3042, so I guess this is the problem of encode, but anyway I could not solved this.


The other is suggested by @Jean-Claude Arbaut in the comment; the call of python from reticulate package.

os = reticulate::import("os")
os$listdir("test")

Although there is small overhead when we import os module, it works well! At least in my case, this method is likely solution.

hmIto
  • 73
  • 1
  • 4
  • 1
    NTFS store names in [UTF-16](https://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as), not UTF-8, hence the path is actually shorter (anyway the [limit](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation) is 255 for a path component). On Windows 10, with Python I can create files with names up to 255 characters using extended paths with \\?\, including UTF-16 chars (paths with greek α repeated for instance). However, with R 4.2.2 the dir command simply crashes R. And `ls` in MSYS2 bash fails too. – Jean-Claude Arbaut Aug 30 '23 at 11:36
  • On Windows 11 with R 4.3.1 and the same files, R does not crash but can only show files up to 129 characters. Python has no problem with `os.listdir()` though. There is a bug somewhere, but difficult to say where: Python is compiled with MSVC while R is compiled with MinGW-gcc, so there *might* be a problem in the R code or in a library used by R. – Jean-Claude Arbaut Aug 30 '23 at 11:44
  • 2
    It may be worth reading Tomas Kalibera's recent blog post: https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/ – Mikael Jagan Aug 30 '23 at 12:11
  • @Jean-ClaudeArbaut Thank you for the comments. About the file name code, I misunderstood. However, I checked the relationship between the number of Unicode character and ASCII characters, so I guess this problem seems to caused in UTF-8; at least the Unicode character \u3042 is counted as 3 bytes. About the crash, even in my environment crash occurs in R4.2.2, but it works without error in R4.3.1. I added this information in the post. – hmIto Aug 30 '23 at 12:51
  • Note that Unicode character 0x3072 (HIRAGANA LETTER A) is in the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)) (BMP), hence stored as one UTF-16 character. – Jean-Claude Arbaut Aug 30 '23 at 12:54
  • @MikaelJagan Very useful link. I have an objection to "Bash in Msys2 as well as cmd.exe and Powershell can work with long directories." though: for instance I get """$ ls ls: cannot access 'ααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααααα'$'\316': No such file or directory.""" I have no idea what this $'\316' means. – Jean-Claude Arbaut Aug 30 '23 at 12:55
  • @MikaelJagan, thank you for the information. From your comment, I found a release news about [changes in R4.3.0](https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf), "R on Windows is now able to work with path names longer than 260 characters when these are enabled in the system (requires at least Windows 10 version 1607)." So, the reason why the crash not occur in R4.3.1 may be related to this! However, they are saying about the character number instead of byte number, and my total path length is less than 260, so I think my code should be supported even before this release... – hmIto Aug 30 '23 at 13:04
  • My own view is: if you're using file names of that kind of length, you need to rethink your naming convention. I had big enough problems accessing remote (inside-company)servers due to the absurdly long directory paths thanks to Windows' stupid limits on path lengths. Don't make it worse – Carl Witthoft Aug 30 '23 at 16:03
  • @CarlWitthoft I know very well that kind of problem. However, Windows has support for paths up to 32K chars. I have seen examples of deep directories for which shortening was awkward. Sticking to 260 is akin to saying we should rethink RAM use and stick to 640K because we had so many problems with himem. Users don't have to stick to bad habits. Instead applications must be upgraded. This is a bug, plain and simple. R not being able to deal with Unicode on Windows decades after Windows had it was also a bug (UCS2 in NT, UTF16 in win2k). Thankfully it improved with 4.2. – Jean-Claude Arbaut Aug 30 '23 at 16:53
  • @Jean-ClaudeArbaut My comment, referring to long file names, is IMHO still valid, because ridiculously long names start to lose their meaning, or at least make it really difficult to find the specific file wanted. – Carl Witthoft Aug 30 '23 at 21:06
  • 1
    @CarlWitthoft It's getting OT, but *sometimes*, in an enterprise environment, for files that are stored for a long time and used by changing teams, it's useful to have long descriptive names, especially to find a specific file. My 2¢ Now to return to the topic, if detecting this situation is critical, I'd suggest doing it in Python, as it seems to work well. Through `reticulate` or a call to `system()`. Maybe in C++, but if the problem lies deeper in MinGW it might not work. – Jean-Claude Arbaut Aug 30 '23 at 22:40
  • 1
    @CarlWitthoft I perfectly agree with the opinion that the too long names are not only worthless but harmful. However, in my case, these problematic files come from outside. It is difficult to prevent the submission of a file with too long name, and once such a file mix in my folder, other files with enough short names also not be found by `dir`... – hmIto Aug 31 '23 at 01:06
  • @Jean-ClaudeArbaut Thank you for the suggestion. It works well! I updated my post with your solution! – hmIto Aug 31 '23 at 01:52

0 Answers0