I have a bunch of files which names are encoded with cp1251
.
I also have a bunch of files with utf8
encoded names.
I need a way to find them both with Java code.
Also, I can't change the names with convmv
Linux tool as there are legacy systems that also using these files.
Is there a way to pass encoding in Files or Paths utility methods in Java?
If I use Files.walk now and try to see filenames, they would be already broken and looks like a bunch of ???????? and can't be recovered (or I can't find a way to do that).
Code:
Files.list(Paths.get("/data/my_input"))
.forEach(path1 -> System.out.println(path1.getFileName()));
Will output:
asdasd.txt
download.jpeg
���� ����� � ������� ���������.txt
The real name of ???... file is: тест файла с русскими символами.txt
The system locale is:
locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=ru_RU.UTF-8
LC_TIME=ru_RU.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=ru_RU.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=ru_RU.UTF-8
LC_NAME=ru_RU.UTF-8
LC_ADDRESS=ru_RU.UTF-8
LC_TELEPHONE=ru_RU.UTF-8
LC_MEASUREMENT=ru_RU.UTF-8
LC_IDENTIFICATION=ru_RU.UTF-8
LC_ALL=
JVM running with -Dfile.encoding=UTF-8
If I do ls | iconv -f "cp1251" -t "utf8"
I see:
asdasd.txt
download.jpeg
тест файла с русскими символами.txt
Pure ls
output is same as the java output.
UPDATE: Suggested link from @JosefZ also didn't work.
Example:
name=���� ����� � ������� ���������.txt
fffd fffd fffd fffd 0020 fffd fffd fffd fffd fffd 0020 fffd 0020 fffd fffd fffd fffd fffd fffd fffd 0020 fffd fffd fffd fffd fffd fffd fffd fffd fffd 002e 0074 0078 0074
As we can see, it's only fffd
- so the name is destroyed.
Code:
try (DirectoryStream<Path> dir = Files.newDirectoryStream(Paths.get("/data/my_input/"))) {
for (Path child : dir) {
String filename = child.getFileName().toString();
System.out.println("name=" + filename);
for (char c : filename.toCharArray()) {
System.out.printf("%04x ", (int) c);
}
System.out.println();
}
}
My Java version (as suggested in link that it was jvm bug): java version "1.8.0_201" Java(TM) SE Runtime Environment (build 1.8.0_201-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
UPDATE 2: @skomisa suggestion didn't work.
Code:
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
Files.list(Paths.get("/data/my_input/")).forEach(path1 -> ps.println(path1.getFileName()));
Result:
asdasd.txt
download.jpeg
���� ����� � ������� ���������.txt
If I print out the bytes of the filename we can see, that if we do path.getFileName()
we get a destroyed name.
Code:
Files.list(Paths.get("/data/my_input/")).forEach(path1 -> System.out.println(Arrays.toString(path1.getFileName().toString().getBytes(StandardCharsets.UTF_8))));
Result:
[97, 115, 100, 97, 115, 100, 46, 116, 120, 116]
[100, 111, 119, 110, 108, 111, 97, 100, 46, 106, 112, 101, 103]
[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 46, 116, 120, 116]
-17, -65, -67
means ?
I think