1

I have a problem whereby I need to be able to detect whether a byte array contains characters which comply with ISO-8859-1 encoding.

I have found the following question useful Java : How to determine the correct charset encoding of a stream however none of the answers appear to fully answer my question.

I have attempted to use the TikaEncodingDetector as shown below

public static Charset guessCharset(final byte[] content) throws IOException {
    final InputStream isx = new ByteArrayInputStream(content);
    return Charset.forName(new TikaEncodingDetector().guessEncoding(isx));
}

Unfortunately this approach makes different predictions based about the content of the byte array. E.g. an array containing 'h','e','l','l','o' is determined to be ISO-8859-1. 'w','o','r','l','d' comes out as IBM500, 'a','b','c','d','e' results in UTF-8.

All I want to know is, does my byte array correctly validate to the ISO-8859-1 standard. I would be grateful for suggestions on the best way to carry out this task.

Community
  • 1
  • 1
Rob
  • 145
  • 1
  • 13
  • Guessing an encoding can be very tricky, do you have a good reason for it? – Kayaman Sep 11 '15 at 09:45
  • 2
    AFAIK, every byte is a valid ISO-8859-1 character. So any byte array is valid ISO-8859-1. That doesn't mean the string won't contain control characters, though. So,maybe what you should check is that every character of the String is one of the printable characters you expect in the string. – JB Nizet Sep 11 '15 at 09:52
  • Your best bet is to ask the user. Allow the user to select from a list of encodings and provide an instant preview box where they can see what the text looks like with their chosen encoding. – biziclop Sep 11 '15 at 09:57

2 Answers2

9

I have a problem whereby I need to be able to detect whether a byte array contains characters which comply with ISO-8859-1 encoding.

Well every stream of binary data can be viewed as "valid" in ISO-8859-1, as it's simply a single-byte-per-character scheme mapping bytes 0-255 to U+0000 to U+00FF in a trivial way. Compare that with UTF-8 or UTF-16, where certain byte sequences are simply invalid.

So a method to determine whether a stream contained valid ISO-8859-1 could just return true - but that doesn't mean that the original text was encoded in ISO-8859-1... it may be meaningless to a human when decoded with ISO-8859-1, but still valid.

If you know that the original plain text won't include certain characters (e.g. unprintable control characters) you could detect that quite simply just by checking whether any byte in the stream was blacklisted. More advanced detection might check for unexpected patterns - but it becomes very heuristic, and may be tightly coupled to what the original source text is expected to be like.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    You can go a bit further and check for non-printable characters (0-31) but even that won't take you much closer as the range of non-printable characters is shared with many other encodings. – biziclop Sep 11 '15 at 09:54
  • 1
    Another thing you can do is check whether there is any byte value >127. If there isn't, you're probably okay with ISO-8859-1 (or indeed any sensible encoding defined as the superset of ASCII, e.g. UTF-8). – biziclop Sep 11 '15 at 09:56
  • 2
    @biziclop: Other than EBCDIC, of course... and yes, if you were giving a confidence score, then "everything is between 32 and 127, or tab/line feed/carriage return" would give pretty high confidence. – Jon Skeet Sep 11 '15 at 10:07
  • @JonSkeet If I understand correctly from [wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) there is a range of bytes which does not correspond to any character in the ISO-9959-1 encoding, precisely between 127 and 159 inclusive, (decimal format). – chess4ever Apr 04 '18 at 09:57
  • 1
    @chess4ever: There's a difference between "ISO-8859-1" and "ISO 8859-1", bizarrely enough. (Note the space-or-dash after ISO.) ISO 8859-1 has a gap; ISO-8859-1 doesn't. – Jon Skeet Apr 04 '18 at 09:59
3

ISO-8859-1, or Latin-1, is a single byte encoding without much structure, no format at least. It cannot easily be distinghuished from other single byte encodings.

However the byte 0 will not generally occur in text and might point to a two byte encoding like UTF-16LE or UTF-16BE.

However a multi-byte encoding like UTF-8 is detectable as it follows a strict format.

ISO-8859-1 can be mistaken with Windows-1252, Windows Latin-1. The differing characters in might be identifiable to statistics, as interpunktion is involved.

EBCDIC, a single byte encoding is quite different.

What helps for ISO-8859-* encodings is having frequent word lists of languages & their encodings, and identify the language plus encoding, by best match.

There are some language recognizers around.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138