Encoding detection method doesn't work

Question

I need check encoding type of files. If it is readeble return true.
According this SO answer I converted this logic in Java code. But it doesn't work. Exactly, this part of code:

if ((buffer[0] & 0xF8) == 0xF0) {
        if (((buffer[1] & 0xC0) == 0x80)
            && ((buffer[2] == 0x80) && ((buffer[3] == 0x80))))
            return true;
    } else if ((buffer[0] & 0xF0) == 0xE0) {
        if (((buffer[1] & 0xC0) == 0x80) && ((buffer[2] & 0xC0) == 0x80))
            return true;
    } else if ((buffer[0] & 0xE0) == 0xC0) {
        if (((buffer[1] & 0xC0) == 0x80))
            return true;
    } return false;

This doesn't check correctly, at this time when is checking 100% UTF-8 code! => As result return false.

All pieces of code:

class EncodindsCheck implements Checker {
    private static final int UTF8_HEADER_SIZE = 8;

    @Override
    public boolean check(File currentFile) {
        return isUTF8(currentFile);
    }

    public static boolean isUTF8(File file) {
        // validate input
        if (null == file) {
            throw new IllegalArgumentException("input file can't be null");
        }
        if (file.isDirectory()) {
            throw new IllegalArgumentException(
                    "input file refers to a directory");
        }

        // read input file
        byte[] buffer;
        try {
            buffer = readUTFHeaderBytes(file);
        } catch (IOException e) {
            throw new IllegalArgumentException(
                    "Can't read input file, error = " + e.getLocalizedMessage());
        }

        if ((buffer[0] & 0xF8) == 0xF0) {
            if (((buffer[1] & 0xC0) == 0x80)
                && ((buffer[2] == 0x80) && ((buffer[3] == 0x80))))
                return true;
        } else if ((buffer[0] & 0xF0) == 0xE0) {
            if (((buffer[1] & 0xC0) == 0x80) && ((buffer[2] & 0xC0) == 0x80))
                return true;
        } else if ((buffer[0] & 0xE0) == 0xC0) {
            if (((buffer[1] & 0xC0) == 0x80))
                return true;
        }

        return false;
    }

    private static byte[] readUTFHeaderBytes(File input) throws IOException {
        byte[] buffer = new byte[UTF8_HEADER_SIZE];
        // read data
        FileInputStream fis = new FileInputStream(input);
        fis.read(buffer);
        fis.close();
        return buffer;
    }
}

Questions:

Why doesn't this check work?
How do I solve this check detection in this way (as sequence of UTF-8 characters)?
How do I check other charsets (UTF-16 etc.) ?

Could you provide a sample of the UTF-8 files which are failing? — devrobf, Mar 08 '13 at 14:56
Did you read the original SO answer? Dos the buffer[0] even have a byte > 0x7f? — Ingo, Mar 08 '13 at 15:06
I don't. I just point out that the SO answer you linked to introduced this logic to see if a seqeunce of bytes that start with a byte > 0x7f forms a valid UTF8 code. — Ingo, Mar 08 '13 at 15:21

score 2 · Accepted Answer · answered Mar 08 '13 at 16:31

2

Code points in UTF-8 can be 1, 2, 3 or 4 bytes long.

If all the code points were in the range U+0000 to U+007F then isUTF8 would return false. In this case, the file would be valid for a large number of encodings (UTF-8, ASCII, ANSI-encodings, etc.)

Your UTF-8 check trusts to luck that the first code point is above U+007F.

I suggest you take a look at a more comprehensive encoding detection API, at least as an example.

Note that fis.read(buffer); is not guaranteed to fill the array; the type contract requires you to inspect the return value for the number of bytes read.

answered Mar 08 '13 at 16:31

McDowell

107,573
31
204
267

How can I fix this trouble, and check file on valid way? Do you mind me asking, why should we do `(buffer[0] & 0xF8)` and this result `== 0xF0` - in this example. Why do we need this variant? – catch23 Mar 08 '13 at 18:26
`fis.read(buffer);` - How able to circumvent this uncertainty? – catch23 Mar 09 '13 at 09:24
I haven't checked your numbers, but the masks should check against [the encoding scheme](http://en.wikipedia.org/wiki/UTF-8#Description). For example, if the 1st byte matches `1110xxxx` then the next two must match `10xxxxxx`. But, you can avoid all of this and use the `Decoder` type to [check for malformed input](http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetDecoder.html#onMalformedInput%28java.nio.charset.CodingErrorAction%29). Not that this will guarantee that the file is UTF-8 - just that the data doesn't violate these rules - there is no way to detect encoding reliably. – McDowell Mar 11 '13 at 09:25
Your question about `fis.read(buffer)` warrants a new question. – McDowell Mar 11 '13 at 09:33
Can we use [jChardet](http://jchardet.sourceforge.net/index.html) for this target? – catch23 Mar 13 '13 at 17:27

Encoding detection method doesn't work

1 Answers1