1

I want to check if all the files in a given folder have portable names or if they have some unfortunate names that may make impossible to represent the same file structure on various file systems; I want to at least support the most common cases. For example, on Windows, you can not have a file called aux.txt, and file names are not case sensitive. This is my best attempt, but I'm not an expert in operative systems and file systems design. Looking on wikipedia, I've found 'incomplete' lists of possible problems... but... how can I catch all the issues? Please, look to my code below and see if I've forgotten any subtle unfortunate case. In particular, I've found a lot of 'Windows issues'. Is there any Linux/Mac issue that I should check for?

class CheckFileSystemPortable {
  Path top;
  List<Path> okPaths=new ArrayList<>();
  List<Path> badPaths=new ArrayList<>();
  List<Path> repeatedPaths=new ArrayList<>();

  CheckFileSystemPortable(Path top){
    assert Files.isDirectory(top);
    this.top=top;

    try (Stream<Path> walk = Files.walk(top)) {//the first one is guaranteed to be the root
      walk.skip(1).forEach(this::checkSystemIndependentPath);
    } catch (IOException e) {
      throw new Error(e);
    }

    for(var p:okPaths) {
      checkRepeatedPaths(p);
    }

    okPaths.removeAll(repeatedPaths);
  }

  private void checkRepeatedPaths(Path p) {
    var s=p.toString();
    for(var pi:okPaths){
      if (pi!=p && pi.toString().equalsIgnoreCase(s)) {
        repeatedPaths.add(pi);
      }
    }
  }

//incomplete list from wikipedia below:
//https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
  private static final List<String>forbiddenWin=List.of(
    "CON", "PRN", "AUX", "CLOCK$", "NUL",
    "COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
    "LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
    "LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$", 
    "$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
    "$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse"
    );

  private void checkSystemIndependentPath(Path path) {
    String lastName=path.getName(path.getNameCount()-1).toString();
    String[] parts=lastName.split("\\.");

    var ko = forbiddenWin.stream()
        .filter(f -> Stream.of(parts).anyMatch(p->p.equalsIgnoreCase(f)))
        .count();

    if(ko!=0) {
      badPaths.add(path);
    } else {
      okPaths.add(path);
    }
  }
}
Oleh Dokuka
  • 11,613
  • 5
  • 40
  • 65
Marco Servetto
  • 684
  • 1
  • 5
  • 14
  • You have a long list about Windows but about Linux and Mac It seems it's not that much, check this link: https://stackoverflow.com/a/31976060/9391162 , It says just Null byte and "/" char and ":" in mac are restricted. – AMK Aug 22 '21 at 05:08
  • You have to consider the characters that are invalid too. Windows you have covered all the list and recommend reading https://en.wikipedia.org/wiki/Device_file and https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names – Dinithi Aug 22 '21 at 07:14

3 Answers3

2

If I understand your question correctly and by reading the Filename wikipedia page, portable file names must:

  • Be posix compliant. Eg. alpha numeric ascii characters and _, -
  • Avoid windows and DOS device names.
  • Avoid NTFS special names.
  • Avoid special characters. Eg. \, |, /, $ etc
  • Avoid trailing space or dot.
  • Avoid filenames begining with a -.
  • Must meet max length. Eg. 8-bit Fat has max 9 characters length.
  • Some systems expect an extension with a . and followed by a 3 letter extension.

With all that in mind checkSystemIndependentPath could be simplified a bit, to cover most of those cases using a regex.

For example, POSIX file name, excluding special devices, NTFS, special characters and trailing space or dot:

private void checkSystemIndependentPath(Path path){
    String reserved = "^(CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9])(\\..*)*$";
    String posix = "^[a-zA-Z\\._-]+$";
    String trailing = ".*[\s|\\.]$";
    int nameLimit = 9;

    String fileName = path.getFileName().toString();

    if (fileName.matches(posix) &&
            !fileName.matches(reserved) &&
            !fileName.matches(trailing) &&
            fileName.length() <= nameLimit) {
        okPaths.add(path);
    } else {
        badPaths.add(path);
    }
}

Note that the example is not tested and doesn't cover edge conditions. For example some systems ban dots in a directory names. Some system will complain about multiple dots in a filename.

razboy
  • 978
  • 7
  • 15
  • Thanks a lot. -If I only allow alphanumeric characters, I would automatically avoid 'special characters', right? -What is the danger in allowing '$'? it is a common part of filenames in my setting, I may be able to remove it but I would need a good reason for it. -I'm using 248 as lenght limit. Is there any system in wide use today that uses 9? – Marco Servetto Aug 27 '21 at 05:29
  • 248 seems reasonable ... The smaller limits are for DOS, Amiga and MicroVax all archeological relics. Using `$` in some systems will make it a hidden file. In linux systems `$` is a special character in bash so it may affect bash scripts. but I haven't tried it. – razboy Aug 27 '21 at 05:48
  • Ah and of course it wouldn't be a POSIX compliant file name. POSIX guarantees portability. There is an in-depth discussion here as well https://stackoverflow.com/questions/4814040/allowed-characters-in-filename. As per comment Posix `fully portable filenames` include the following characters `A–Z a–z 0–9 . _ -` only. The level of portability depends on your requirements. – razboy Aug 27 '21 at 05:51
  • >The level of portability depends on your requirements. Of course, but I'm ignorant a lot about file systems, so I'm trying to understand the implications. In particular, '$' is used in *.class files in Java. Is this creating any trouble in any contexts? what systems will see $ as a hidden file? – Marco Servetto Aug 27 '21 at 05:58
  • Ah interesting. Do you have the option to package your app in a java jar archive to make your life easier? For me maven and gradle do it. I run my apps with `java -jar my-app.jar`. The destination os should not be able to see whats in the package and get upset about special characters. – razboy Aug 27 '21 at 06:02
  • Thanks, I know that trick, but my situation is more involved than that: I'm doing my own programming language, and since Java was using '$' I allowed '$' too, and now it has become kind of a standard to name certain files with $s for a specific meaning (nesting of renamed concepts). Also, in the folder with all the source of my language I may need other files needed to run other languages inside mine, It currently support jars and *.class, that in turn can load local *.dll or *.so if needed. I fear at this point I will have to accept '$' and hope for the best... – Marco Servetto Aug 27 '21 at 06:08
  • Thanks for explaining. It sounds like `$` should be ok for your use case. – razboy Aug 27 '21 at 06:15
  • It seems like `$` sign upsets some apps in older versions of Windows, eg XP. For example VLC cannot see files and folders ending in `$`. But I don't think VLC would be able to process bytcode anyway. I am guessing there would be other edge conditions like this. https://forum.videolan.org/viewtopic.php?f=14&t=105578&sid=b1a65db65e4e4ac304023f7a1cf149d3 – razboy Aug 27 '21 at 06:19
0

Assuming your windows forbidden list is correct, and adding ":" (mac) and nul (everywhere), use regex!

private static final List<String> FORBIDDEN_WINDOWS_NAMES = List.of(
        "CON", "PRN", "AUX", "CLOCK$", "NUL",
        "COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
        "LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
        "LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$",
        "$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
        "$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse"
); // you can add more

private static final String FORBIDDEN_CHARACTERS = "\0:"; // you can add more

private static final String REGEX = "^(?i)(?!.*[" + FORBIDDEN_CHARACTERS + "])(.*/)?(?!(\\Q" +
        String.join("\\E|\\Q", FORBIDDEN_WINDOWS_NAMES) + "\\E)(\\.[^/]*)?$).*";

private static Pattern ALLOWED_PATTERN = Pattern.compile(REGEX);

public static boolean isAllowed(String path) {
    return ALLOWED_PATTERN.matcher(path).matches();
}

fyi, the regex generated from the lists/chars as defined here is:

^(?i)(?!.*[<nul>:])(.*/)?(?!(\QCON\E|\QPRN\E|\QAUX\E|\QCLOCK$\E|\QNUL\E|\QCOM0\E|\QCOM1\E|\QCOM2\E|\QCOM3\E|\QCOM4\E|\QCOM5\E|\QCOM6\E|\QCOM7\E|\QCOM8\E|\QCOM9\E|\QLPT0\E|\QLPT1\E|\QLPT2\E|\QLPT3\E|\QLPT4\E|\QLPT5\E|\QLPT6\E|\QLPT7\E|\QLPT8\E|\QLPT9\E|\QLST\E|\QKEYBD$\E|\QSCREEN$\E|\Q$IDLE$\E|\QCONFIG$\E|\Q$Mft\E|\Q$MftMirr\E|\Q$LogFile\E|\Q$Volume\E|\Q$AttrDef\E|\Q$Bitmap\E|\Q$Boot\E|\Q$BadClus\E|\Q$Secure\E|\Q$Upcase\E|\Q$Extend\E|\Q$Quota\E|\Q$ObjId\E|\Q$Reparse\E)(\.[^/]*)?$).*

Each forbidden filename has been wrapped in \Q and \E, which is how you quote an expression in regex so all chars are treated as literal chars. For example, the dollar sign in \Q$Boot\E does't mean end of input, it's just a plain dollar sign.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • your regex is very complicated... it must be doing so much more then just avoiding '\0' and ':'. Can you explain better what is it doing? – Marco Servetto Aug 27 '21 at 05:11
  • It avoids nul and colon and all the windows forbidden filenames as the file name. It may need tweaking, but it should basically work. – Bohemian Aug 27 '21 at 05:12
0

Thanks everyone. I have now made the complete code for this, I'm sharing it as a potential answer, since I think the balances I had to walk are likelly quite common. Main points:

  • I had to chose 248 as a max size
  • I had to accept '$' in file names.
  • I had to completelly skip any file/folder/subtree that is either labelled as hidden (win) or startin with '.'; those files are hidden and likelly to be autogenerated, out of my control, and anyway not used by my application.
  • Of course if your application relies on ".**" files/folders, you may have to check for those.
  • Another point of friction is multiple dots: not only some system may be upset, but it is not clear where the extension starts and the main name end. For example, I had a usecase with the file derby-10.15.2.0.jar inside. Is the extension .jar or .15.2.0.jar? does some system disagree on this? For now, I'm forcing to rename those files as, for example, derby-10_15_2_0.jar

public class CheckFileSystemPortable{
  Path top;
  List<Path> okPaths = new ArrayList<>();
  List<Path> badPaths = new ArrayList<>();
  List<Path> repeatedPaths = new ArrayList<>();
  public void makeError(..) {..anything you need for a good message..}
  public boolean isDirectory(Path top){ return Files.isDirectory(top); }
  //I override the above when I do mocks for testing

  public CheckFileSystemPortable(Path top){
    assert isDirectory(top);
    this.top = top;
    walkIn1(top);
    for(var p:okPaths){ checkRepeatedPaths(p); }
    okPaths.removeAll(repeatedPaths);
    }
  public void walkIn1(Path path) {
    try(Stream<Path> walk = Files.walk(path,1)){
      //the first one is guaranteed to be the root
      walk.skip(1).forEach(this::checkSystemIndependentPath);
      }
   catch(IOException e){ throw /*unreachable*/; }
   }
 private void checkRepeatedPaths(Path p){
   var s = p.toString();
   for(var pi:okPaths){
     if (pi!=p && pi.toString().equalsIgnoreCase(s)) {repeatedPaths.add(pi);}
     }
   }
 private static final List<String>forbiddenWin = List.of(
   "CON", "PRN", "AUX", "CLOCK$", "NUL",
   "COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
   "LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
   "LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$", 
   "$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
   "$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse",
   ""
   );
 static final Pattern regex = Pattern.compile(//POSIX + $,
   "^[a-zA-Z0-9\\_\\-\\$]+$");// but . is handled separately
 public void checkSystemIndependentPath(Path path){
   String lastName=path.getFileName().toString();
   //too dangerous even for ignored ones
   if(lastName.equals(".") || lastName.equals("..")) { badPaths.add(path); return; }
   boolean skip = path.toFile().isHidden() || lastName.startsWith(".");
   if(skip){ return; }
   var badSizeEndStart = lastName.length()>248 
     ||lastName.endsWith(".") 
     ||lastName.endsWith("-") 
     || lastName.startsWith("-");
   if(badSizeEndStart){ badPaths.add(path); return; }
   var i=lastName.indexOf(".");
   var fileName = i==-1?lastName:lastName.substring(0,i);
   var extension = i==-1?"":lastName.substring(i+1);
   var extensionDots = extension.contains(".");
   if(extensionDots){ badPaths.add(path); return; }
   var badDir = isDirectory(path) && i!=-1;
   if(badDir){ badPaths.add(path); return; }
   var badFileName = !regex.matcher(fileName).matches();
   var badExtension = !extension.isEmpty() && !regex.matcher(extension).matches();
   if(badFileName||badExtension){ badPaths.add(path); return; }
   var ko = forbiddenWin.stream()
    .filter(f->fileName.equalsIgnoreCase(f)).count();
   if(ko!=0){ badPaths.add(path); return; }
   okPaths.add(path);
   walkIn1(path);//recursive exploration
   }
 }
Marco Servetto
  • 684
  • 1
  • 5
  • 14