20

What is the most correct regular expression (regex) for a UNIX file path?

For example, to detect something like this:

/usr/lib/libgccpp.so.1.0.2

It's pretty easy to make a regular expression that will match most files, but what's the best one, including one that can detect escaped whitespace sequences, and unusual characters you don't usually find in file paths on UNIX.

Also, are there library functions in several different programming languages that provide a file path regex?

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
Neil
  • 24,551
  • 15
  • 60
  • 81
  • "escaped whitespace sequences"? Using what escape syntax? UNIX paths have no such escapes. sh/ksh/bash have a mostly common escape syntax, URL's have another, Perl yet another. – Darron Feb 11 '09 at 21:27

6 Answers6

15

The proper regular expression to match all UNIX paths is: [^\0]+

That is, one or more characters that are not a NUL.

Darron
  • 21,309
  • 5
  • 49
  • 53
14

If you don't mind false positives for identifying paths, then you really just need to ensure the path doesn't contain a NUL character; everything else is permitted (in particular, / is the name-separator character). The better approach would be to resolve the given path using the appropriate file IO function (e.g. File.exists(), File.getCanonicalFile() in Java).

Long answer:

This is both operating system and file system dependent. For example, the Wikipedia comparison of file systems notes that besides the limits imposed by the file system,

MS-DOS, Microsoft Windows, and OS/2 disallow the characters \ / : ? * " > < | and NUL in file and directory names across all filesystems. Unices and Linux disallow the characters / and NUL in file and directory names across all filesystems.

In Windows, the following reserved device names are also not permitted as filenames:

CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5,
COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, 
LPT5, LPT6, LPT7, LPT8, LPT9
Zach Scrivena
  • 29,073
  • 11
  • 63
  • 73
  • Additional: because of the variety between file systems, there are methods that get you the information you need. – Robert P Feb 11 '09 at 17:21
  • Those Win special devices are even worse than you think. I once renamed a C header from const.h to con.h and the compiler seemed to hang. Took a while to figure out it was reading the header file from the console because Win ignored the extension. Caveat: this may have been DOS, it was a long time ago. – paxdiablo Jul 10 '12 at 16:36
  • 1
    Useful information, but I don't understand why this non-answer is accepted...? – Slbox Apr 01 '20 at 23:45
11

To others who have answered this question, it's important to note that some applications would require a slightly different regex, depending on how escape characters work in the program you're writing. If you were writing a shell, for example, and wanted to have command separated by spaces and other special characters, you would have to modify your regex to only include words with special characters if those characters are escaped.

So, for example, a valid path would be

  /usr/bin/program\ with\ space 

as opposed to

  /usr/bin/program with space 

which would refer to "/usr/bin/program" with arguments "with" and "space"

A regex for the above example could be "([^\0 ]\|\\ )*"

The regex that I've been working on is (newline separated for 'readability'):

  "\(                    # Either
       [^\0 !$`&*()+]    # A normal (non-special) character
     \|                  # Or
       \\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\)   # An escaped special character
   \)\+"                   # Repeated >= 1 times

Which translates to

  "\([^\0 !$`&*()+]\|\\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\)\)\+"

Creating your own specific regex should be relatively simple, as well.

steventrouble
  • 6,641
  • 3
  • 16
  • 19
  • 3
    As an alternative to enumerating all of the escaped characters, you can simply make a group that consists of the escape followed by the class of escaped characters `([^ !$\`&*()+]|(\\[ !$\`&*()+]))+` – Rob Hall Mar 02 '16 at 20:38
7
^(/)?([^/\0]+(/)?)+$

This will accept every path that is legal in filesystems such as extX, reiserfs.

It discards only the path names containing the NUL or double (or more) slashes. Everything else according to Unix spec should be legal (I'm suprised with this outcome too).

Danubian Sailor
  • 1
  • 38
  • 145
  • 223
  • 2
    double slashes are perfectly fine in unix paths, both in POSIX and in practise, so your regex is incorrect. the only character (or rather, octet) not allowed in unix pathnames is \0 – Remember Monica Dec 28 '12 at 08:08
  • @RememberMonica are you saying a path like `/var///test/file.txt` is valid? – Slbox Apr 01 '20 at 23:49
  • 1
    @Slbox Yes that's a perfectly valid file path. `/var///test/file.txt` and `/var/test/file.txt` are equivalent. This convention makes some file path operations simpler. E.g. `userProvidedPath + "/filename.txt"` works wether `userProvidedPath` contains a trailing slash or not. – Scindix May 02 '20 at 15:59
  • Note that a variant of this regexp has proven to be susceptible to catastrophic backtracking for us, at least on Ruby-embedded Oniguruma, if the input string contains multiple forward slashes following each other. Something to keep in mind. – Julik Dec 25 '20 at 23:33
4

I'm not sure how common a regex check for this is across systems, but most programming languages (especially the cross platform ones) provide a "file exists" check which will take this kind of thing into account

Out of curiosity, where are these paths being input? Could you control that to a greater degress to the point where you won't have to check the individual pieces of the path? For example using a file chooser dialog?

greg
  • 259
  • 1
  • 1
1

Question already answered here: https://stackoverflow.com/a/42036026/1951947

Community
  • 1
  • 1
raythurnevoid
  • 2,652
  • 1
  • 25
  • 24