I need check encoding type of files. If it is readeble return true.
According this SO answer I converted this logic in Java code. But it doesn't work. Exactly, this part of code:
if ((buffer[0] & 0xF8) == 0xF0) {
if (((buffer[1] & 0xC0) == 0x80)
&& ((buffer[2] == 0x80) && ((buffer[3] == 0x80))))
return true;
} else if ((buffer[0] & 0xF0) == 0xE0) {
if (((buffer[1] & 0xC0) == 0x80) && ((buffer[2] & 0xC0) == 0x80))
return true;
} else if ((buffer[0] & 0xE0) == 0xC0) {
if (((buffer[1] & 0xC0) == 0x80))
return true;
} return false;
This doesn't check correctly, at this time when is checking 100% UTF-8 code! => As result return false
.
All pieces of code:
class EncodindsCheck implements Checker {
private static final int UTF8_HEADER_SIZE = 8;
@Override
public boolean check(File currentFile) {
return isUTF8(currentFile);
}
public static boolean isUTF8(File file) {
// validate input
if (null == file) {
throw new IllegalArgumentException("input file can't be null");
}
if (file.isDirectory()) {
throw new IllegalArgumentException(
"input file refers to a directory");
}
// read input file
byte[] buffer;
try {
buffer = readUTFHeaderBytes(file);
} catch (IOException e) {
throw new IllegalArgumentException(
"Can't read input file, error = " + e.getLocalizedMessage());
}
if ((buffer[0] & 0xF8) == 0xF0) {
if (((buffer[1] & 0xC0) == 0x80)
&& ((buffer[2] == 0x80) && ((buffer[3] == 0x80))))
return true;
} else if ((buffer[0] & 0xF0) == 0xE0) {
if (((buffer[1] & 0xC0) == 0x80) && ((buffer[2] & 0xC0) == 0x80))
return true;
} else if ((buffer[0] & 0xE0) == 0xC0) {
if (((buffer[1] & 0xC0) == 0x80))
return true;
}
return false;
}
private static byte[] readUTFHeaderBytes(File input) throws IOException {
byte[] buffer = new byte[UTF8_HEADER_SIZE];
// read data
FileInputStream fis = new FileInputStream(input);
fis.read(buffer);
fis.close();
return buffer;
}
}
Questions:
- Why doesn't this check work?
- How do I solve this check detection in this way (as sequence of UTF-8 characters)?
- How do I check other charsets (UTF-16 etc.) ?