0

I have a bunch of files which names are encoded with cp1251. I also have a bunch of files with utf8 encoded names. I need a way to find them both with Java code. Also, I can't change the names with convmv Linux tool as there are legacy systems that also using these files.

Is there a way to pass encoding in Files or Paths utility methods in Java?

If I use Files.walk now and try to see filenames, they would be already broken and looks like a bunch of ???????? and can't be recovered (or I can't find a way to do that).

Code:

Files.list(Paths.get("/data/my_input"))
   .forEach(path1 -> System.out.println(path1.getFileName()));

Will output:

asdasd.txt
download.jpeg
���� ����� � ������� ���������.txt

The real name of ???... file is: тест файла с русскими символами.txt

The system locale is:

locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=ru_RU.UTF-8
LC_TIME=ru_RU.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=ru_RU.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=ru_RU.UTF-8
LC_NAME=ru_RU.UTF-8
LC_ADDRESS=ru_RU.UTF-8
LC_TELEPHONE=ru_RU.UTF-8
LC_MEASUREMENT=ru_RU.UTF-8
LC_IDENTIFICATION=ru_RU.UTF-8
LC_ALL=

JVM running with -Dfile.encoding=UTF-8

If I do ls | iconv -f "cp1251" -t "utf8" I see:

asdasd.txt
download.jpeg
тест файла с русскими символами.txt

Pure ls output is same as the java output.

UPDATE: Suggested link from @JosefZ also didn't work.

Example:

name=���� ����� � ������� ���������.txt
fffd fffd fffd fffd 0020 fffd fffd fffd fffd fffd 0020 fffd 0020 fffd fffd fffd fffd fffd fffd fffd 0020 fffd fffd fffd fffd fffd fffd fffd fffd fffd 002e 0074 0078 0074 

As we can see, it's only fffd - so the name is destroyed.

Code:

try (DirectoryStream<Path> dir = Files.newDirectoryStream(Paths.get("/data/my_input/"))) {
    for (Path child : dir) {
        String filename = child.getFileName().toString();

        System.out.println("name=" + filename);
        for (char c : filename.toCharArray()) {
            System.out.printf("%04x ", (int) c);
        }
        System.out.println();
    }
}

My Java version (as suggested in link that it was jvm bug): java version "1.8.0_201" Java(TM) SE Runtime Environment (build 1.8.0_201-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

UPDATE 2: @skomisa suggestion didn't work.

Code:

PrintStream ps = new PrintStream(System.out, true, "UTF-8");      
Files.list(Paths.get("/data/my_input/")).forEach(path1 -> ps.println(path1.getFileName()));

Result:

asdasd.txt
download.jpeg
���� ����� � ������� ���������.txt

If I print out the bytes of the filename we can see, that if we do path.getFileName() we get a destroyed name. Code:

Files.list(Paths.get("/data/my_input/")).forEach(path1 -> System.out.println(Arrays.toString(path1.getFileName().toString().getBytes(StandardCharsets.UTF_8))));

Result:

[97, 115, 100, 97, 115, 100, 46, 116, 120, 116]
[100, 111, 119, 110, 108, 111, 97, 100, 46, 106, 112, 101, 103]
[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 46, 116, 120, 116]

-17, -65, -67 means ? I think

akvyalkov
  • 273
  • 5
  • 17
  • Does this answer your question? [Get filename as UTF-8? (ä,ü,ö ... is always '?')](https://stackoverflow.com/questions/6117624/get-filename-as-utf-8-%c3%a4-%c3%bc-%c3%b6-is-always) – JosefZ Mar 24 '21 at 15:37
  • @JosefZ thank you for the suggestion. I've seen that question and unfortunately, it has no java solutions. – akvyalkov Mar 24 '21 at 15:50
  • Did you try `System.console().writer().println(path1.getFileName());` instead of `System.out.println()`? That is output from `ls` (without piping to `iconv`)? – JosefZ Mar 24 '21 at 16:01
  • @JosefZ no, I haven't tried your suggestion yet. Pure `ls` output is same as the java output. – akvyalkov Mar 24 '21 at 16:10
  • Weird, Maybe [java read write unicode / UTF-8 filenames](https://stackoverflow.com/questions/14171565/)? – JosefZ Mar 24 '21 at 16:24
  • Is your question confined to distinguishing between just two encodings (i.e. cp1251 and utf8), or are you asking about the general case of how to determine the encoding of any arbitrary filename using Java? Please update your question to clarify that. – skomisa Mar 25 '21 at 05:48
  • What happens if you tweak your `println()` call to write to a UTF8 `PrintStream`, like this: `PrintStream ps = new PrintStream(System.out, true, "UTF-8"); Files.list(Paths.get("/data/my_input")).forEach(path1 -> ps.println(path1.getFileName()));` Does that cause all the filenames to print correctly? – skomisa Mar 25 '21 at 07:17
  • @skomisa I know what encoding is used for filenames (it's cp1251). I need a way to find them from java, where encoding is set to utf8. – akvyalkov Mar 25 '21 at 14:58

1 Answers1

0

As I found out there is byte[] path in sun.nio.fs.UnixPath, which contains original filename bytes in an unchanged state. If I take it and convert to cp1251 I'll get proper name with cyrillic characters: тест файла с русскими символами.txt

Sadly, there is no proper way to get access to this field. So, I looked in available methods of Path class and saw toUri method, which takes the value from the path field.

There is a solution that somewhat works:

Path tryToFindWithCp1251Encoding(Path directory, String filePathToSearch) throws IOException {
    try (Stream<Path> paths = Files.walk(directory)) {
        for (Iterator<Path> it = paths.iterator(); it.hasNext(); ) {
            Path path = it.next();

            // Using getRawPath method to exclude Uri prefix like "file:///"
            String uriString = path.toUri().getRawPath();

            // The "+" sign is a special character when decoding from url-encoded strings,
            // so we need to replace it by hand on "%2B".
            // See https://stackoverflow.com/a/6926987/2530910 (also look at the comments)
            uriString = uriString.replace("+", "%2B");

            String decodedFilePathFromCp1251 = URLDecoder.decode(uriString, "Cp1251");

            if (decodedFilePathFromCp1251.equals(filePathToSearch)) {
                return path;
            }
        }
        return null;
    }
}

It's rather a hacky solution, I'll prefer to use a more clean and proper way without intermediate URI conversion to get this done. But, at least, it works.

akvyalkov
  • 273
  • 5
  • 17