How to detect if a file is not utf-8 encoded?

Question

In Java, how can a file be tested that it's encoding is definitely not utf-8?

I want to be able to validate if the contents are well-formed utf-8.

Furthermore, also need to validate that the file does not start with the byte order mark (BOM).

Possible duplicate of https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream — Pushpesh Kumar Rajwanshi, Oct 28 '18 at 19:28
@PushpeshKumarRajwanshi I'm not trying to determine the encoding. The file is assumed to be encoded in utf-8. The validation is to determine if this is not the case. — yas, Oct 28 '18 at 19:48
You need to detect if there are invalid multi character combinations by reading 5he contents of the file as if it was binary. — Peter Lawrey, Oct 28 '18 at 19:52
Do you need the contents of the file, or do you just need to check it without retaining the contents? — VGR, Oct 29 '18 at 14:58
@VGR without retaining the contents. For my use case, the contents are already saved as a file on the system. — yas, Oct 29 '18 at 16:34

VGR · Answer 1 · 2018-10-30T14:28:21.917

If you just need to test the file, without actually retaining its contents:

Path path = Paths.get("/home/dave/somefile.txt");
try (Reader reader = Files.newBufferedReader(path)) {
    int c = reader.read();
    if (c == 0xfeff) {
        System.out.println("File starts with a byte order mark.");
    } else if (c >= 0) {
        reader.transferTo(Writer.nullWriter());
    }
} catch (CharacterCodingException e) {
    System.out.println("Not a UTF-8 file.");
}

Files.newBufferedReader always uses UTF-8 if no charset is provided.
0xfeff is the byte order mark codepoint.
reader.transferTo(Writer.nullWriter()) (available as of Java 11) processes the file and immediately discards it.

How to detect if a file is not utf-8 encoded?

1 Answers1