4

In Java, how can a file be tested that it's encoding is definitely not utf-8?

I want to be able to validate if the contents are well-formed utf-8.

Furthermore, also need to validate that the file does not start with the byte order mark (BOM).

yas
  • 3,520
  • 4
  • 25
  • 38
  • Possible duplicate of https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream – Pushpesh Kumar Rajwanshi Oct 28 '18 at 19:28
  • @PushpeshKumarRajwanshi I'm not trying to determine the encoding. The file is assumed to be encoded in utf-8. The validation is to determine if this is not the case. – yas Oct 28 '18 at 19:48
  • You need to detect if there are invalid multi character combinations by reading 5he contents of the file as if it was binary. – Peter Lawrey Oct 28 '18 at 19:52
  • Do you need the contents of the file, or do you just need to check it without retaining the contents? – VGR Oct 29 '18 at 14:58
  • @VGR without retaining the contents. For my use case, the contents are already saved as a file on the system. – yas Oct 29 '18 at 16:34

1 Answers1

2

If you just need to test the file, without actually retaining its contents:

Path path = Paths.get("/home/dave/somefile.txt");
try (Reader reader = Files.newBufferedReader(path)) {
    int c = reader.read();
    if (c == 0xfeff) {
        System.out.println("File starts with a byte order mark.");
    } else if (c >= 0) {
        reader.transferTo(Writer.nullWriter());
    }
} catch (CharacterCodingException e) {
    System.out.println("Not a UTF-8 file.");
}
  • Files.newBufferedReader always uses UTF-8 if no charset is provided.
  • 0xfeff is the byte order mark codepoint.
  • reader.transferTo(Writer.nullWriter()) (available as of Java 11) processes the file and immediately discards it.
VGR
  • 40,506
  • 4
  • 48
  • 63