140

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

How can I safely encode the String so that it can be used as a filename?

Edit: What I'm hoping for is an API call that does this for me.

I can do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184

11 Answers11

124

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

String name = s.replaceAll("\\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

EDIT:Fixed regex for java

alianos-
  • 886
  • 10
  • 21
cletus
  • 616,129
  • 168
  • 910
  • 942
  • I don't think it's a good idea to provide the bad solution first. In addition, MD5 is a nearly cracked hash algorithm. I recommend at least SHA-1 or better. – vog Jul 26 '09 at 10:12
  • 21
    For the purposes of creating a unique filename who cares if the algorithm is "broken"? – cletus Jul 26 '09 at 11:07
  • 5
    @cletus: the problem is that different strings will map to the same filename; i.e. collision. – Stephen C Jul 26 '09 at 11:19
  • 3
    A collision would have to be deliberate, the original question doesn't talk about these strings being chosen by an attacker. – tialaramex Jul 26 '09 at 12:33
  • A problem no-one has really addressed is that there are limits on filename length and on total length of a file path, plus arbitrary limits on file names on some platforms, and even a limit on how many files can be in a particular directory. And this is a Java question, so we can't be sure the software will only run on (fill in the name of your favourite OS here). Thus I think any adequate solution would want to consider how to retry or what else to do if the name tried is rejected by the OS. – tialaramex Jul 26 '09 at 12:36
  • @tialaramax: re collisions. Suppose that the user simply wants to store two distinct files using names that happen to collide. Result: one overwrites the other. Re filename limits: the simple answer is to report "name too long" or "too many files" to the user. There clearly have to be limits somewhere. Differences in file name syntax could be handled via a config setting to say which chars are illegal. – Stephen C Jul 26 '09 at 12:56
  • Collisions is why I suggest not using user input for a filename but instead using your own scheme but storing the user's preferred name in a database as a convenience to them. This avoids security and collision issues. – cletus Jul 26 '09 at 16:07
  • 8
    You need to use `"\\W+"` for the regexp in Java. Backslash first applies to the string itself, and `\W` is not a valid escape sequence. I tried to edit the answer, but looks like someone rejected my edit :( – vadipp May 08 '13 at 09:29
  • 1
    How can we exclude characters from the regex above? i.e. spaces, which are safe for filenames. – alianos- Jan 23 '14 at 10:59
  • Side question: is `\\W+` necessary when using `replaceAll`? I would have naturally gone with `\\W`. – Duncan Jones Aug 20 '14 at 13:51
  • What's about numbers? You just dropped numbers from file name. And letters from another languages too. – Evgen Bodunov Dec 28 '16 at 08:47
  • 1
    What if the string is nothing but characters which get dropped (e.g., "_")? This would lead to an empty filename. – twm Jun 12 '17 at 17:36
36

It depends on whether the encoding should be reversible or not.

Reversible

Use URL encoding (java.net.URLEncoder) to replace special characters with %xx. Note that you take care of the special cases where the string equals ., equals .. or is empty!¹ Many programs use URL encoding to create file names, so this is a standard technique which everybody understands.

Irreversible

Use a hash (e.g. SHA-1) of the given string. Modern hash algorithms (not MD5) can be considered collision-free. In fact, you'll have a break-through in cryptography if you find a collision.


¹ You can handle all 3 special cases elegantly by using a prefix such as "myApp-". If you put the file directly into $HOME, you'll have to do that anyway to avoid conflicts with existing files such as ".bashrc".
public static String encodeFilename(String s)
{
    try
    {
        return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8");
    }
    catch (java.io.UnsupportedEncodingException e)
    {
        throw new RuntimeException("UTF-8 is an unknown encoding!?");
    }
}
vog
  • 23,517
  • 11
  • 59
  • 75
  • 2
    URLEncoder's idea of what is a special character may not be correct. – Stephen C Jul 26 '09 at 10:53
  • @Stephen C: according to the documentation (see URLEncoder link), the function generates strings which contain at most the following 67 characters: a-z, A-Z, 0-9, ".", "-", "*", "_" and "+". Each of them is allowed in file names. (yes, "*" is allowed!) – vog Jul 26 '09 at 11:08
  • 4
    @vog: URLEncoder fails for "." and "..". These must be encoded or else you will collide with directory entries in $HOME – Stephen C Jul 26 '09 at 11:12
  • @vog: Just for completeness, there is a third case - "possibly reversible" - which can be implemented with a computationally cheap hash, by removing 'bad' characters or by a variety of other means. – Stephen C Jul 26 '09 at 11:33
  • 6
    @vog: "*" is only allowed in most Unix-based filesystems, NTFS and FAT32 do not support it. – Jonathan Aug 17 '09 at 18:26
  • 1
    "." and ".." can be dealt with by escaping dots to %2E when string is only dots (if you want to minimize the escape sequences). '*' can also be replaced by "%2A". – viphe Jan 03 '13 at 18:48
  • 2
    note that any approach that lengthens the file name (by changing single characters to %20 or whatever) will invalidate some file names that are close to the length limit (255 characters for Unix systems) – smcg Aug 12 '14 at 15:43
  • Note that you probably shouldn't put files directly into `$HOME`- users will hate you for doing that. Create a directory and put your files into that. – Jonas Czech Feb 08 '17 at 08:10
31

Here's what I use:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. I just prepended the current time and date, and a short random string to avoid that. (an actual random string, not a hash of the filename, since identical filenames will result in identical hashes)

Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.

Jonas Czech
  • 12,018
  • 6
  • 44
  • 65
  • 9
    Another problem is that it is specific to languages that use ASCII characters. For other languages, it would result in filenames consisting of nothing but underscores. – Andy Thomas Nov 15 '17 at 04:54
17

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. If collisions must be avoided, then simple replacement or removal of "bad" characters is not the answer either.

Instead you want something like this. (Note: this should be treated as an illustrative example, not something to copy and paste.)

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

URLEncoder works, but it has the disadvantage that it encodes a whole lot of legal file name characters.

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.


The reverse of the above encoding should be equally straight-forward to implement.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
13

For those looking for a general solution, these might be common critera:

  • The filename should resemble the string.
  • The encoding should be reversible where possible.
  • The probability of collisions should be minimized.

To achieve this we can use regex to match illegal characters, percent-encode them, then constrain the length of the encoded string.

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-]");

private static final int MAX_LENGTH = 127;

public static String escapeStringAsFilename(String in){

    StringBuffer sb = new StringBuffer();

    // Apply the regex.
    Matcher m = PATTERN.matcher(in);

    while (m.find()) {

        // Convert matched character to percent-encoded.
        String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase();

        m.appendReplacement(sb,replacement);
    }
    m.appendTail(sb);

    String encoded = sb.toString();

    // Truncate the string.
    int end = Math.min(encoded.length(),MAX_LENGTH);
    return encoded.substring(0,end);
}

Patterns

The pattern above is based on a conservative subset of allowed characters in the POSIX spec.

If you want to allow the dot character, use:

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-\\.]");

Just be wary of strings like "." and ".."

If you want to avoid collisions on case insensitive filesystems, you'll need to escape capitals:

private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]");

Or escape lower case letters:

private static final Pattern PATTERN = Pattern.compile("[^A-Z0-9_\\-]");

Rather than using a whitelist, you may choose to blacklist reserved characters for your specific filesystem. E.G. This regex suits FAT32 filesystems:

private static final Pattern PATTERN = Pattern.compile("[%\\.\"\\*/:<>\\?\\\\\\|\\+,\\.;=\\[\\]]");

Length

On Android, 127 characters is the safe limit. Many filesystems allow 255 characters.

If you prefer to retain the tail, rather than the head of your string, use:

// Truncate the string.
int start = Math.max(0,encoded.length()-MAX_LENGTH);
return encoded.substring(start,encoded.length());

Decoding

To convert the filename back to the original string, use:

URLDecoder.decode(filename, "UTF-8");

Limitations

Because longer strings are truncated, there is the possibility of a name collision when encoding, or corruption when decoding.

Community
  • 1
  • 1
SharkAlley
  • 11,399
  • 5
  • 51
  • 42
4

Pick your poison from the options presented by commons-codec, example:

String safeFileName = DigestUtils.sha1(filename);
hd1
  • 33,938
  • 5
  • 80
  • 91
4

Try using the following regex which replaces every invalid file name character with a space:

public static String toValidFileName(String input)
{
    return input.replaceAll("[:\\\\/*\"?|<>']", " ");
}
BullyWiiPlaza
  • 17,329
  • 10
  • 113
  • 185
4

This is probably not the most effective way, but shows how to do it using Java 8 pipelines:

private static String sanitizeFileName(String name) {
    return name
            .chars()
            .mapToObj(i -> (char) i)
            .map(c -> Character.isWhitespace(c) ? '_' : c)
            .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_')
            .map(String::valueOf)
            .collect(Collectors.joining());
}

The solution could be improved by creating custom collector which uses StringBuilder, so you do not have to cast each light-weight character to a heavy-weight string.

voho
  • 2,805
  • 1
  • 21
  • 26
2

If your system stores files in a case sensitive filesystem (where it is possible to store a.txt and A.txt in the same directory), then you could use Base64 in the variant "base64url". It is "URL- and filename-safe" according to https://en.wikipedia.org/wiki/Base64#Variants_summary_table because it uses "-" and "_" instead of "+" and "/".

Apache commons-codec implements this: https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafeString-byte:A-

If your filename / directory name is too long then split it into multiple directories: [first 128 characters]/[second 128 characters]/...

As there is no dot in the Base64 charset you don't have to care about special filenames like . or .. or about a final dot at the end of the filename. Also you don't have to care about trailing spaces, ...

If there are reserved words/filenames in your filesystem (or your operating system) like LPT4 in Windows and the result of Base64url-encoding is equal to a reserved word like this you could mask it with e.g. an @ character (@LPT4) and removing the masking @ character before decoding. Have a look for reserved words here: https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words

In a Linux system this could work forwards and backwards without loss of data/characters, I guess. Windows will reject having two files named e.g. "abcd" and "ABCD".

1

If you don't care about reversibility, but want to have nice names in most circumstances that are cross platform compatible, here is my approach.

//: and ? into .
name = name.replaceAll("[\\?:]", ".");

//" into '
name = name.replaceAll("[\"]", "'");

//\, / and | into ,
name = name.replaceAll("[\\\\/|]", ",");

//<, > and * int _
name = name.replaceAll("[<>*]", "_");
return name;

This turns:

This is a **Special** "Test": A\B/C is <BETTER> than D|E|F! Or?

Into:

This is a __Special__ 'Test'. A,B,C is _BETTER_ than D,E,F! Or.
Torge
  • 2,174
  • 1
  • 23
  • 33
0

Convert your String hexadecimal (e.g. with this https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Hex.html#encodeHexString-byte:A- ). Works forwards and backwards ( https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Hex.html#decodeHex-char:A- ).

Split the resulting String into chunks of 128 characters with one (sub)directory for every chunk.

Even in case-insensitive filesystems / operating systems there is no collision (like it could be in Base64).

At the moment I don't know any reserved filename (like COM, LPT1, ...) that would have a collision with a HEX value, so I guess that there is no need for masking. And even if masking would be needed then use e.g. a @ in front of the filename and remove it when decoding the filename into the original String.