64

I'm making a cross-platform application that renames files based on data retrieved online. I'd like to sanitize the Strings I took from a web API for the current platform.

I know that different platforms have different file-name requirements, so I was wondering if there's a cross-platform way to do this?

Edit: On Windows platforms you cannot have a question mark '?' in a file name, whereas in Linux, you can. The file names may contain such characters and I would like for the platforms that support those characters to keep them, but otherwise, strip them out.

Also, I would prefer a standard Java solution that doesn't require third-party libraries.

Ben S
  • 68,394
  • 30
  • 171
  • 212

8 Answers8

33

As suggested elsewhere, this is not usually what you want to do. It is usually best to create a temporary file using a secure method such as File.createTempFile().

You should not do this with a whitelist and only keep 'good' characters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use an include list for this reason, we have to use an exclude list.

Linux pretty much allows anything which can be a real pain. I would just limit Linux to the same list that you limit Windows to so you save yourself headaches in the future.

Using this C# snippet on Windows I produced a list of characters that are not valid on Windows. There are quite a few more characters in this list than you may think (41) so I wouldn't recommend trying to create your own list.

        foreach (char c in new string(Path.GetInvalidFileNameChars()))
        {
            Console.Write((int)c);
            Console.Write(",");
        }

Here is a simple Java class which 'cleans' a file name.

public class FileNameCleaner {
final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};
static {
    Arrays.sort(illegalChars);
}
public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    for (int i = 0; i < badFileName.length(); i++) {
        int c = (int)badFileName.charAt(i);
        if (Arrays.binarySearch(illegalChars, c) < 0) {
            cleanName.append((char)c);
        }
    }
    return cleanName.toString();
}
}

EDIT: As Stephen suggested you probably also should verify that these file accesses only occur within the directory you allow.

The following answer has sample code for establishing a custom security context in Java and then executing code in that 'sandbox'.

How do you create a secure JEXL (scripting) sandbox?

Stephan
  • 41,764
  • 65
  • 238
  • 329
Sarel Botha
  • 12,419
  • 7
  • 54
  • 59
  • 1
    Good java example, but why didn't you include the forward slash (47)? – THelper Jan 16 '12 at 10:27
  • 1
    No idea why it's not in the list. We actually just ran into this problem in production code. I've fixed the answer to include 47. Thanks. – Sarel Botha Jan 17 '12 at 16:14
  • 3
    The illegalChars array has to be sorted for `binarySearch` to work properly. Please add `Arrays.sort(illegalChars)` or change the array to "{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 42, 47, 58, 60, 62, 63, 92, 124}" – Franz Kafka Jul 24 '13 at 11:59
  • Your solution uses `charAt()`... Basically you should never use `charAt`. Consider it as deprecated. Reason is that `charAt` cannot deal with Unicode code points outside of the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane) as it's a 16-bit value. Instead, use [codePointAt()](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int)) which returns an integer. In addition this removes the need for the cast to int that you are currently doing. – Stijn de Witt Oct 17 '14 at 08:03
  • Keep in mind that `length()` returns the number of chars so if you use `codePointAt` you need to use [codePointCount()](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int)): `badFileName.codePointCount(0, badFileName.length());` – Stijn de Witt Oct 17 '14 at 08:08
  • Mmm you are also appending wrong... I'll post updated code with correct Unicode handling in a separate answer. – Stijn de Witt Oct 17 '14 at 08:12
27

or just do this:

String filename = "A20/B22b#öA\\BC#Ä$%ld_ma.la.xps";
String sane = filename.replaceAll("[^a-zA-Z0-9\\._]+", "_");

Result: A20_B22b_A_BC_ld_ma.la.xps

Explanation:

[a-zA-Z0-9\\._] matches a letter from a-z lower or uppercase, numbers, dots and underscores

[^a-zA-Z0-9\\._] is the inverse. i.e. all characters which do not match the first expression

[^a-zA-Z0-9\\._]+ is a sequence of characters which do not match the first expression

So every sequence of characters which does not consist of characters from a-z, 0-9 or . _ will be replaced.

D-rk
  • 5,513
  • 1
  • 37
  • 55
  • 14
    This works on a file name that uses only English letters. If the file is made up of only Chinese characters then you will strip everything out of it. We can't use whitelists on strings to strip bad characters for this reason, we have to use blacklists. – Sarel Botha Jan 03 '14 at 19:15
  • Have a look here: http://stackoverflow.com/questions/9576384/use-regular-expression-to-match-any-chinese-character-in-utf-8-encoding it should work if you use Java 7 – D-rk Jan 04 '14 at 09:45
  • @Dirk Downvoted because regex is not the solution here. What if the filenames are in multiple languages? – Franz Kafka Oct 20 '17 at 14:25
  • 1
    it depends on the actual requirements. if whitelisting characters is sufficient, this is solution is much more readable. – D-rk Oct 21 '17 at 07:15
  • 4
    To preserve non-latin characters in the filename, you can use the unicode flag (since Java 1.7) as follows: `String sane = filename.replaceAll("(?U)[^\\w\\._]+", "_") ;` – Arie Feb 22 '19 at 11:00
18

This is based on the accepted answer by Sarel Botha which works fine as long as you don't encounter any characters outside of the Basic Multilingual Plane. If you need full Unicode support (and who doesn't?) use this code instead which is Unicode safe:

public class FileNameCleaner {
  final static int[] illegalChars = {34, 60, 62, 124, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 58, 42, 63, 92, 47};

  static {
    Arrays.sort(illegalChars);
  }

  public static String cleanFileName(String badFileName) {
    StringBuilder cleanName = new StringBuilder();
    int len = badFileName.codePointCount(0, badFileName.length());
    for (int i=0; i<len; i++) {
      int c = badFileName.codePointAt(i);
      if (Arrays.binarySearch(illegalChars, c) < 0) {
        cleanName.appendCodePoint(c);
      }
    }
    return cleanName.toString();
  }
}

Key changes here:

  • Use codePointCount i.c.w. length instead of just length
  • use codePointAt instead of charAt
  • use appendCodePoint instead of append
  • No need to cast chars to ints. In fact, you should never deal with chars as they are basically broken for anything outside the BMP.
Community
  • 1
  • 1
Stijn de Witt
  • 40,192
  • 13
  • 79
  • 80
  • You can use standard functions and work with chars - you just need to skip character that follows surrogate pair character. Also chars don't ever need to be casted to numeric types - they are numeric by design. – weaknespase Dec 29 '14 at 20:51
  • 2
    I have read both the top answer and this one, and this one appears to be more carefully considered...however I cannot find any case where this code performs correctly and the other one doesn't. What input demonstrates the difference? – Doddie Sep 04 '19 at 14:27
  • This code and the top rated answer both fail with characters outside the 16bit range. The correct way to iterate is described here: https://stackoverflow.com/a/361345/278329. Example for error `"abcdef"` – x4rf41 Mar 20 '23 at 04:46
9

Here is the code I use:

public static String sanitizeName( String name ) {
    if( null == name ) {
        return "";
    }

    if( SystemUtils.IS_OS_LINUX ) {
        return name.replaceAll( "[\u0000/]+", "" ).trim();
    }

    return name.replaceAll( "[\u0000-\u001f<>:\"/\\\\|?*\u007f]+", "" ).trim();
}

SystemUtils is from Apache commons-lang3

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
6

There's a pretty good built-in Java solution - Character.isXxx().

Try Character.isJavaIdentifierPart(c):

String name = "name.é+!@#$%^&*(){}][/=?+-_\\|;:`~!'\",<>";
StringBuilder filename = new StringBuilder();

for (char c : name.toCharArray()) {
  if (c=='.' || Character.isJavaIdentifierPart(c)) {
    filename.append(c);
  }
}

Result is "name.é$_".

David Carboni
  • 1,556
  • 23
  • 24
  • okay, so it's a conservative way and doesn't meet the original question fully (cross-platform), but worked for me :) – Mark D Feb 13 '13 at 11:11
  • 8
    It does remove hyphen which is valid for filenames (at least in Windows) but it does the job, anyway I think Apache Commons FilenameUtils should incorporate a cross platform way to get this done – Jaime Hablutzel Mar 08 '13 at 21:22
  • also it removes "@" too which is again valid in Windows. – azerafati Mar 31 '14 at 14:30
5

It is not clear from your question, but since you are planning to accept pathnames from a web form (?) you probably ought block attempts renaming certain things; e.g. "C:\Program Files". This implies that you need to canonicalize the pathnames to eliminate "." and ".." before you make your access checks.

Given that, I wouldn't attempt to remove illegal characters. Instead, I'd use "new File(str).getCanonicalFile()" to produce the canonical paths, next check that they satisfy your sandboxing restrictions, and finally use "File.exists()", "File.isFile()", etc to check that the source and destination are kosher, and are not the same file system object. I'd deal with illegal characters by attempting to do the operations and catching the exceptions.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

Paths.get(...) throws a detailed exception with the position of the illegal character.

public static String removeInvalidChars(final String fileName)
{
  try
  {
    Paths.get(fileName);
    return fileName;
  }
  catch (final InvalidPathException e)
  {
    if (e.getInput() != null && e.getInput().length() > 0 && e.getIndex() >= 0)
    {
      final StringBuilder stringBuilder = new StringBuilder(e.getInput());
      stringBuilder.deleteCharAt(e.getIndex());
      return removeInvalidChars(stringBuilder.toString());
    }
    throw e;
  }
}
l.poellabauer
  • 736
  • 3
  • 11
  • 18
  • 4
    Ouch. Clever, but don't use that if you require a fast solution (try/catch and recursion). Also if you accept user input from the web, do not forget to trim the input; otherwise posting a filename 1Mb long full of invalid chars would stack-overflow your server for sure ;) – Laurent Grégoire May 15 '19 at 12:38
0

If you want to use more than like [A-Za-z0-9], then check MS Naming Conventions, and dont forget to filter out "...Characters whose integer representations are in the range from 1 through 31,...", like the example of Aaron Digulla does. The code e.g. from David Carboni would not be sufficient for these chars.

Excerpt containing the list of reserved characters:

Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:

The following reserved characters:

  • < (less than)
  • > (greater than)
  • : (colon)
  • " (double quote)
  • / (forward slash)
  • \ (backslash)
  • | (vertical bar or pipe)
  • ? (question mark)
  • * (asterisk)
  • Integer value zero, sometimes referred to as the ASCII NUL character.
  • Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File Streams.
  • Any other character that the target file system does not allow.
E_net4
  • 27,810
  • 13
  • 101
  • 139
wandlang
  • 11
  • 3