4

All our text-based files are encoded in UTF-8 or latin-1 (Windows). The only "special characters" we use are the German umlauts ä, ö, ü and the ß.

For different reasons (including historical, but also the old problem of "properties files cannot be UTF-8"), we cannot unify our encoding completely.

This obviously leads to errors when people read a text file in Java and use the wrong encoding.

Is there an easy, reliable way to detect whether a file is UTF-8 or latin-1 if you know that the only possible special characters are the ones indicated above.

Or do I need to read the file as byte array and search for special bytes?

J Fabian Meier
  • 33,516
  • 10
  • 64
  • 142
  • because you want to check just for some special characters, maybe this can help you `myText.matches(".*[äöüß].*")` – Youcef LAIDANI Aug 11 '17 at 14:00
  • would the utf-8 encoded files have a [byte order mark](https://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom)? otherwise there is no way to tell for sure (for example it is likely but not guaranteed that umlauts occur in a short german text so this can never serve as a safe indication). The other way round, there can well be an utf-8 file with no character beyond the classic ascii charset in it, so without a byte order mark how would you tell? – Cee McSharpface Aug 11 '17 at 14:01
  • Unfortunately, there is no meta information in standard text files. That means the file contains no indication of it's content or the encoding. The only thing that could give you a clue is the [unicode byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) but these are rarely used. So you'll have to "guess" the encoding yourself by inspecting the bytes. – f1sh Aug 11 '17 at 14:02
  • via [this answer](https://stackoverflow.com/a/4522251/1132334), [this looks promising](https://github.com/superstrom/chardetsharp) – Cee McSharpface Aug 11 '17 at 14:03
  • You could search for the first byte that is not in the ASCII range and then check wether is is an Umlaut or ß in latin-1 or the start of a utf-8 encoding of such an Umlaut. This determines the encoding. – Henry Aug 11 '17 at 14:24

1 Answers1

2

If the only non-ASCII characters are "ä, ö, ü and the ß" then you could use the fact that their first code is 195 (-61 as a byte) in UTF_8. Character 195 is à in ISO 8859 which apparently you don't expect to find.

So a solution could be something like this:

public static String readFile(Path p) throws IOException {
  byte[] bytes = Files.readAllBytes(p);
  boolean isUtf8 = false;
  for (byte b : bytes) {
    if (b == -61) {
      isUtf8 = true;
      break;
    }
  }
  return new String(bytes, isUtf8 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1);
}

This is of course quite fragile and won't work if the file contains other special characters.

assylias
  • 321,522
  • 82
  • 660
  • 783