38

I'm trying to build a regular expression that will detect any character that Windows does not accept as part of a file name (are these the same for other OS? I don't know, to be honest).

These symbols are:

 \ / : * ? "  | 

Anyway, this is what I have: [\\/:*?\"<>|]

The tester over at http://gskinner.com/RegExr/ shows this to be working. For the string Allo*ha, the * symbol lights up, signalling it's been found. Should I enter Allo**ha however, only the first * will light up. So I think I need to modify this regex to find all appearances of the mentioned characters, but I'm not sure.

You see, in Java, I'm lucky enough to have the function String.replaceAll(String regex, String replacement). The description says:

Replaces each substring of this string that matches the given regular expression with the given replacement.

So in other words, even if the regex only finds the first and then stops searching, this function will still find them all.

For instance: String.replaceAll("[\\/:*?\"<>|]","")

However, I don't feel like I can take that risk. So does anybody know how I can extend this?

Svante
  • 50,694
  • 11
  • 78
  • 122
KdgDev
  • 14,299
  • 46
  • 120
  • 156
  • -1 make this a qeustion and tell us the language or context you are using and I will give you your vote back – ojblass Apr 16 '09 at 00:33
  • I would also like to know what language your using. – Kredns Apr 16 '09 at 00:33
  • 3
    Be aware that, because your regex is in the form of a Java string literal, you have to double-escape backslashes: "[\\\\/:*?\"<>|]". The way you had it, you were just escaping the forward-slash (which isn't necessary, but it's not an error either). – Alan Moore Apr 16 '09 at 06:17
  • One more thing: If you're trying to create regexes that will work in Java's native regex support, you should use a tester that's powered by Java, like this one: http://www.fileformat.info/tool/regex.htm (RegExr uses ActionScript's regex engine.) – Alan Moore Apr 16 '09 at 06:28
  • You can also try various String.replaceAll() in series like this: YourString.replaceAll("[^A-Za-z0-9_.\\s-" + File.separator + "]*", "").replaceAll("^\\s", "").replaceAll("\\s$", "")); – Luis Aug 15 '12 at 05:33
  • What's the best regular expression that will allow as many supported characters as possible , on Linux (or more precisely, on Android) ? – android developer Apr 22 '15 at 22:33
  • See also http://stackoverflow.com/questions/1155107/is-there-a-cross-platform-java-method-to-remove-filename-special-chars – Vadzim Jun 22 '15 at 15:55

12 Answers12

22

since no answer was good enough i did it myself. hope this helps ;)

public static boolean validateFileName(String fileName) {
    return fileName.matches("^[^.\\\\/:*?\"<>|]?[^\\\\/:*?\"<>|]*") 
    && getValidFileName(fileName).length()>0;
}

public static String getValidFileName(String fileName) {
    String newFileName = fileName.replace("^\\.+", "").replaceAll("[\\\\/:*?\"<>|]", "");
    if(newFileName.length()==0)
        throw new IllegalStateException(
                "File Name " + fileName + " results in a empty fileName!");
    return newFileName;
}
Alex_M
  • 1,824
  • 1
  • 14
  • 26
  • 4
    This does not remove all invalid characters. You've left out special characters, for example. – Ray Nicholus Jun 21 '12 at 02:17
  • 2
    Doesn't the ^ prevent this matching special characters except at the start of the file name? I used fileName.replace("^\\.+", "").replaceAll("[\\\\/:*?\"<>|]", "") – Oliver Bock Jun 11 '14 at 01:36
  • 1
    If you try to use any other replace character then "" the regex in the answer will fail. The Regex of Oliver Bock works fine. – Markus Jun 12 '14 at 13:39
  • what about `trim`ming to avoid leading or ending space – TOPKAT Dec 09 '18 at 18:22
18

Windows filename rules are tricky. You're only scratching the surface.

For example here are some things that are not valid filenames, in addition to the chracters you listed:

                                    (yes, that's an empty string)
.
.a
a.
 a                                  (that's a leading space)
a                                   (or a trailing space)
com
prn.txt
[anything over 240 characters]
[any control characters]
[any non-ASCII chracters that don't fit in the system codepage,
 if the filesystem is FAT32]

Removing special characters in a single regex sub like String.replaceAll() isn't enough; you can easily end up with something invalid like an empty string or trailing ‘.’ or ‘ ’. Replacing something like “[^A-Za-z0-9_.]*” with ‘_’ would be a better first step. But you will still need higher-level processing on whatever platform you're using.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Windows filename rules are indeed tricky. No one (not even Microsoft) has written a fully correct set of rules. I haven't either. But I can tell you "." is legal (that directory always exists), and ".a" and "a." and com and >240 characters etc. can be created by escaping the names perfectly legally. – Windows programmer Apr 16 '09 at 02:19
  • Well ‘.’ (and ‘..’) are a legal pathnames, but you can't use them as filenames, obviously! How do you ‘escape’ leading/trailing dots and reserved filenames? I can't see any public interface that allows it; both the UI and the file IO interface rename the dots and disallow the reserved name. – bobince Apr 16 '09 at 09:04
  • (I can create the long pathnames by renaming and moving, but it causes Explorer and many other applications to be unstable when accessing them, which is why it's undesirable.) – bobince Apr 16 '09 at 09:05
  • 1
    "copy con \\.\d:\.a" (without the quotes), Enter key to start, Ctrl+Z to stop. File d:\.a exists just fine. Well, fortunately someone accepted your answer so lots of future readers can be misled too. – Windows programmer Apr 17 '09 at 01:20
  • "copy con \\.\d:\con" and you get to use con with both meanings. By the way this assumes drive d is a disk; if it isn't then say drive c or something else. – Windows programmer Apr 17 '09 at 01:23
6

I use pure and simple regular expression. I give characters that may occur and through the negation of "^" I change all the other as a sign of such. "_"

String fileName = someString.replaceAll("[^a-zA-Z0-9\\.\\-]", "_");

For example: If you do not want to be in the expression a "." in then remove the "\\."

String fileName = someString.replaceAll("[^a-zA-Z0-9\\-]", "_");

Adam111p
  • 3,469
  • 1
  • 23
  • 18
2

The required regex / syntax (JS):

.trim().replace(/[\\/:*?\"<>|]/g,"").substring(0,240);

where the last bit is optional, use only when you want to limit the length to 240.

other useful functions (JS):

.toUppperCase();
.toLowerCase();
.replace(/  /g,' ');     //normalising multiple spaces to one, add before substring.
.includes("str");        //check if a string segment is included in the filename
.split(".").slice(-1);   //get extension, given the entire filename contains a .
Jason
  • 64
  • 1
  • 3
2

For the record, POSIX-compliant systems (including UNIX and Linux) support all characters except the null character ('\0') and forwards slash ('/') in filenames. Special characters such as space and asterisk must be escaped on the command line so that they do not take their usual roles.

Artelius
  • 48,337
  • 13
  • 89
  • 105
1

I extract all word characters and whitespace characters from the original string and I also make sure that whitespace character is not present at the end of the string. Here is my code snippet in java.

temp_string = original.replaceAll("[^\\w|\\s]", "");
final_string = temp_string.replaceAll("\\s$", "");

I think I helped someone.

Vysakh Prem
  • 93
  • 1
  • 12
1

Java has a replaceAll function, but every programming language has a way to do something similar. Perl, for example, uses the g switch to signify a global replacement. Python's sub function allows you to specify the number of replacements to make. If, for some reason, your language didn't have an equivalent, you can always do something like this:

while (filename.matches(bad_characters)
  filename.replace(bad_characters, "")
Pesto
  • 23,810
  • 2
  • 71
  • 76
0

I made one very simple method that works for me for most common cases:

// replace special characters that windows doesn't accept
private String replaceSpecialCharacters(String string) {
    return string.replaceAll("[\\*/\\\\!\\|:?<>]", "_")
            .replaceAll("(%22)", "_");
}

%22 is encoded if you have qoute (") in your file names.

Ivan Aracki
  • 4,861
  • 11
  • 59
  • 73
0

You cannot do this with a single regexp, because a regexp always matches a substring if the input. Consider the word Alo*h*a, there is no substring that contains all *s, and not any other character. So if you can use the replaceAll function, just stick with it.

BTW, the set of forbidden characters is different in other OSes.

jpalecek
  • 47,058
  • 7
  • 102
  • 144
  • I'm not sure I understand what you're saying, but you can definitely match invalid filenames with a regex. – wilhelmtell Apr 16 '09 at 00:46
  • Yes, but you cannot sanitize invalid filenames by replacing a single occurence of a regex without lots of collateral damage – jpalecek Apr 16 '09 at 10:25
0

I was caught up in the same situation where I wanted to name files directly from a script that contained a vast majority of special characters. The approach I came up in Python was to do something like

re.sub(r"[^]\w\s`,!@#$&%_^\-)}{\['.(]", "_", text)

Java equivalent would be:

text.replaceAll("[^]\w\s`,!@#$&%_^\-)}{\['.(]", "_")

Note: I'm using Windows 11 and it supports , ! @ # $ % ^ & ` '

@Balaco mentioned that it doesn't support %, I'm not sure which version, so please do try naming files with special characters in your system to figure out the rules

-1

Windows also do not accept "%" as a file name.

If you are building a general expression that may affect files that will eventually be moved to other operating system, I suggest that you put more characters that may have problems with them.

For example, in Linux (many distributions I know), some users may have problems with files containing [b]& ! ] [ / - ( )[/b]. The symbols are allowed in file names, but they may need to be specially treated by users and some programs have bugs caused by their existence.

Balaco
  • 12
  • 1
  • 8
-1

You might try allowing only the stuff you want the user to be able to enter, for example A-Z, a-z, and 0-9.

Kredns
  • 36,461
  • 52
  • 152
  • 203