6

Because the constructor of java.io.File takes a java.lang.String as argument, there is seemingly no possibility to tell it which filename encoding to expect when accessing the filesystem layer. So when you generally use UTF-8 as filename encoding and there is some filename containing an umlaut encoded as ISO-8859-1, you are basically **. Is this correct?

Update: because noone seemingly gets it, try it yourself: when creating a new file, the environment variable LC_ALL (on Linux) determines the encoding of the filename. It does not matter what you do inside your source code!

If you want to give a correct answer, demonstrate that you can create a file (using regular Java means) with proper ISO-8859-1 encoding while your JVM assumes LC_ALL=en_US.UTF-8. The filename should contain a character like ö, ü, or ä.

BTW: if you put filenames with encoding not appropriate to LC_ALL into maven's resource path, it will just skip it....

Update II.

Fix this: https://github.com/jjYBdx4IL/filenameenc

ie. make the f.exists() statement become true.

Update III.

The solution is to use java.nio.*, in my case you had to replace File.listFiles() with Files.newDirectoryStream(). I have updated the example at github. BTW: maven seems to still use the old java.io API.... mvn clean fails.

user1050755
  • 11,218
  • 4
  • 45
  • 56
  • 1
    `file.encoding` determines the default charset to use when _reading text files_. It has nothing to do with file names. – fge Apr 01 '14 at 02:37
  • Also, if you use Java 7+, you should really use java.nio.file – fge Apr 01 '14 at 02:38
  • Then checkout my test case at github. That's definitely wrong. And to your second suggestion: do you really want one to use JDK 7 just to delete from files with bad names? – user1050755 Apr 02 '14 at 00:18
  • 1
    You would want to use JDK7 for many other reasons, like JDK6 no longer being officially supported. – Karol S Oct 27 '14 at 16:02

5 Answers5

5

The solution is to use the new API and file.encoding. Demonstration:

fge@alustriel:~/tmp/filenameenc$ echo $LC_ALL
en_US.UTF-8
fge@alustriel:~/tmp/filenameenc$ cat Test.java
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class Test
{

    public static void main(String[] args)
    {
        final String testString = "a/üöä";
        final Path path = Paths.get(testString);
        final File file = new File(testString);
        System.out.println("Files.exists(): " + Files.exists(path));
        System.out.println("File exists: " + file.exists());
    }
}
fge@alustriel:~/tmp/filenameenc$ install -D /dev/null a/üöä 
fge@alustriel:~/tmp/filenameenc$ java Test
Files.exists(): true
File exists: true
fge@alustriel:~/tmp/filenameenc$ java -Dfile.encoding=iso-8859-1 Test
Files.exists(): false
File exists: true
fge@alustriel:~/tmp/filenameenc$ 

One less reason to use File!

fge
  • 119,121
  • 33
  • 254
  • 329
  • Well it already proves one thing: with the new API, you cannot create a _path_ when its string representation _cannot be encoded_. And your filename cannot. – fge Apr 02 '14 at 06:56
  • You can. You just have to switch your locale settings between different runs of the JVM. See me demonstration at github. – user1050755 Apr 02 '14 at 07:09
  • No you can't; didn't you see the stack trace above? (btw, LC_ALL to anything ISO yields US-ASCII as a charset) – fge Apr 02 '14 at 07:25
  • See my update... I was wrong on file.encoding but right on Path: it DOES the job correctly. – fge Apr 02 '14 at 07:59
  • You avoid to read the badly encoded filename from disk. How am I supposed to access a badly encoded filename when I don't know the wrongly encoded name? – user1050755 Apr 04 '14 at 14:58
  • The solution is indeed to use java.nio.*, in my case you had to replace File.listFiles() with Files.newDirectoryStream(). – user1050755 Apr 04 '14 at 15:19
  • For anyone who encounters filename/path encoding issues in docker environment, this nice post helped me quickly: https://mikemybytes.com/2016/05/16/solving-locale-issues-with-docker-containers/ – Jens Kreidler Oct 28 '21 at 13:18
0

Currently I am sitting at a Windows machine, but assuming you can fetch the file system encoding:

String encoding = System.getProperty("file.encoding");
String encoding = system.getEnv("LC_ALL");

Then you have the means to check whether a filename is valid. Mind: Windows can represent Unicode filenames, and my own Linux of course uses UTF-8.

boolean validEncodingForFileName(String name) {
    try {
        byte[] bytes = name.getBytes(encoding);
        String nameAgain = new String(bytes, encoding);
        return name.equals(nameAgain); // Nothing lost?
    } catch (UnsupportedEncodingException ex) {
        return false; // Maybe true, more a JRE limitation.
    }
}

You might try whether File is clever enough (I cannot test it):

boolean validEncodingForFileName(String name) {
    return new File(name).getCanonicalPath().endsWith(name);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

How I fixed java.io.File (on Solaris 5.11):

  • set the LC_* environment variable(s) in the shell/globally.

    eg. java -DLC_ALL="en_US.ISO8859-1" does not work!

  • make sure the set locale is installed on the system

Why does that fix it?

Java internally calls nl_langinfo() to find out the encoding of paths on the HD, which does not notice environment variables set "for java" via -DVARNAME.

Secondly, this falls back to C/ASCII if the locale set by eg. LC_ALL is not installed.

-3

String can represent any encoding:

new File("the file name with \u00d6")

or

new File("the file name with Ö")

Julio
  • 720
  • 7
  • 16
  • 2
    No. A string has no representation (like UTF-8 etc) at all by itself. It may have an internal one, but that is of no concern to you as a programmer. – user1050755 Apr 01 '14 at 15:08
-4

You can set the Encoding while reading and writing the File. as a example when you write to file you can give the encoding to your out put stream writer as follows. new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8") .

When you read a file you can give the decoding character set as flowing class constructor . InputStreamReader(InputStream in, CharsetDecoder dec)

Niroshan Abayakoon
  • 913
  • 1
  • 10
  • 22