Jackson objectMapper cannot read UTF-8

Question

As in title, Jackson can't read utf-8.

Line 37:

ArrayNode arrayNode1 = objectMapper.readValue(bansFile, ArrayNode.class);

21:48:55 [SEVERE] com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xb3 at [Source: (File); line: 18, column: 38]

Here is line 18, can't read UTF-8 "ł"

"reason" : "Administrator nie podał powodu banicji"

Whole StackTrace

21:48:55 [SEVERE]     at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3569)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3565)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2511)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2437)
21:48:55 [SEVERE]     at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:293)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:267)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:437)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer$ArrayDeserializer.deserialize(JsonNodeDeserializer.java:141)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer$ArrayDeserializer.deserialize(JsonNodeDeserializer.java:126)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4202)
21:48:55 [SEVERE]     at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3070)
21:48:55 [SEVERE]     at koral.proxyban.listeners.ServerConnect.isBanned(ServerConnect.java:37)
21:48:55 [SEVERE]     at koral.proxyban.listeners.ServerConnect.onProxyConnect(ServerConnect.java:25)

Can you provide a sample of the file and the way you read it? — tzortzik, Jan 26 '21 at 21:26

Kayaman · Answer 1 · 2021-01-27T13:54:45.563

No, the error message is saying that the data is not UTF-8.

It looks to be ISO-LATIN-2 (or equivalent) based on the fact that the offending character is ł encoded as byte 0xb3.

Your choices depend on many things. If your data is coming from an outside source you may have no say in the encoding (or you may contact the data supplier and ask them to provide data in UTF8). Then you would have to do something like

BufferedReader br = new BufferedReader(new InputStreamReader(
               new FileInputStream("yourfile"), "ISO-8859-2");    
objectMapper.readValue(br, ArrayNode.class);

In this case the InputStreamReader will correctly convert the bytes to chars, and Jackson won't have to deal with bytes at all (just text). But it also requires you to know that the file is encoded using ISO-8859-2 (i.e. Latin-2).

There are ways to guess a file's encoding, but it cannot be done safely programmatically, so you can't say "open the file in the correct encoding". The way I knew how to debug this problem was to look up common polish encodings, then see where ł is encoded with 0xb3 as in the error message.

Unfortunately there are many methods in the API that use the "default platform encoding", which is not always UTF8. So you may write a file that you think is in UTF8 because you forgot to explicitly specify that you want UTF8, such as with new OutputStreamWriter(new FileOutputStream("yourfile"), StandardCharsets.UTF_8);.

This applies to all places where bytes are converted to character and vice versa, so file access, reading text from a network socket and so on.

You are right. I just changed manually type of file to UTF-8 and now it's reading. So the problem was that file is created in Ansii or something like that. ``` bansFile = new File(ProxyServer.getInstance().getPluginsFolder(), "/ProxyBan/bans.json"); ``` — korallo, Jan 26 '21 at 21:41

Federico Paparoni · Accepted Answer · 2021-01-27T13:59:13.693

1

The problem isn't related to Jackson, because JSON accepted encodings are UTF8,UTF16 and UTF32.

If you write the file, you can save it using

OutputStreamWriter writer = new OutputStreamWriter(
                  new FileOutputStream("yourfile"), StandardCharsets.UTF_8);

if the file is created from other sources, you must read it with the correct encoding

BufferedReader br = new BufferedReader(new InputStreamReader(
                   new FileInputStream("yourfile"), SOME_CHARSET));

and then save the contents in UTF-8 otherwise Jackson will not accept it

edited Jan 27 '21 at 13:59

answered Jan 26 '21 at 21:51

Federico Paparoni

672
4
24

2

No, if the file is from another source, you must read it with the encoding it was written in which may not be UTF-8 (although that's less and less common these days). If you expect it to be in UTF-8 and it's not in UTF-8, then you must do something so your expectations are met, such as tell your data provider to send the data in the correct encoding, or find out the encoding and do something about it yourself. – Kayaman Jan 27 '21 at 09:28
1

No, UTF8 can map ASCII, which is bytes `0-127`. UTF8 is not compatible with extended charsets like Latin-1 which use 1 byte for a character that requires multiple bytes in UTF8 (which is the crux of this question) so your answer or your comment is just not correct in any way. – Kayaman Jan 27 '21 at 10:45
1

I'm sorry, I've just answered so many encoding questions that I can't allow people to spread incorrect information, especially when the asker marks it as "correct" when they don't understand how encodings work themselves. It's the only way to stop the spread of misinformation. – Kayaman Jan 27 '21 at 10:49
1

I'm positive. One of the design decisions of UTF8 was that it was binary compatible with 7 bit ASCII. Once you go over 7 bits, UTF8 becomes a [multi-byte character set](https://en.wikipedia.org/wiki/UTF-8#Encoding), while other encodings use the values `128-255` to encode other characters. When decoding it can be determined that the encoding isn't actually UTF8, so you get an error (or replacement character) instead of just garbled data. Whereas if you decode with a single-byte encoding, all bytes are valid so you get [garbled data](https://en.wikipedia.org/wiki/Mojibake). – Kayaman Jan 27 '21 at 11:42
1

See also https://stackoverflow.com/questions/29667977/converting-string-from-one-charset-to-another/39308860#39308860 – Kayaman Jan 27 '21 at 11:43
1

Latin-2 encodes `ł` to *one byte* `0xB3`. UTF-8 encodes `ł` to *two bytes* `0xC5 0x82`. Obviously those two are not compatible. It's like you asking me "Can you not say hello?" Yes, I can say hello, but when I say hello, it sounds like "moikka", not like "ciao". So if you're reading my letter written in Finnish, you better understand it's not Italian. The same meaning has a different word in a different language, just like the same character has different bytes in a different encoding. – Kayaman Jan 27 '21 at 11:56
1

This gets less and less relevant as non-UTF encodings are pretty much legacy, but that's why people trip up on them so often (and somehow bruteforce their problem away without understanding what the problem really was). Here's a decent explanation https://kunststube.net/encoding/ read Spolsky's original article for the in-depth analysis. It's unfortunately too complex for something quite irrelevant these days. – Kayaman Jan 27 '21 at 12:15
Ok, re-reading my answer I understand where I'm wrong. When you read you must manage the data in utf8 but anyway to read it you must pass the correct charset. I updated the answer – Federico Paparoni Jan 27 '21 at 13:24

Jackson objectMapper cannot read UTF-8

2 Answers2