0

I'm facing the following situation:

We poll some csv data from an external source. The source's response headers doesn't specify that which is the charset, and the data contains some german characters which are showing as a question mark inside a rombus (I know that means the character is not defined in UTF-8).

We want to do some work with this data, and then forward it, but to fix this issue, we want to also encode the erroneous characters to a correct format to show them properly.

I have read already some answers here and most of them suggested using "string.getBytes("encoding")" method, and then create a new string from this with some other encoding.

But from what I understand I need a different thing, as this method just decodes the characters and process the same bytes in respect to another encoding, and some characters get represented with different byte lengths in utf-8 than for example ISO-8859-1 (which I believe the data we are polling is really encoded in) which causes strange characters appearing in the result string so its not really what we want to achieve.

I would need something which can

  1. Get the character from a byte representation in a source encoding
  2. Get the character from a byte representation in the target encoding
  3. Iterate over the decoded byte array and replace all characters byte representation with the representation from the target encoding

After this it would be safe to create a new string from the byte array with the target encoding. So if anyone knows a good library which can do that? I dont want to implement it myself if its already there.

Vendel Serke
  • 135
  • 11
  • 1
    Be careful what you read. Most of the answers here give you absolute garbage advice, as the people who wrote them don't understand how character encoding works (they think they do unfortunately). Your main problem is in identifying the encoding, everything else is a piece of cake. However identifying an encoding isn't necessarily easy, at least if you have a lot of different options. – Kayaman Nov 16 '17 at 12:34
  • Maybe this can help you (and yes, I have no idea about encoding...) http://jchardet.sourceforge.net/ – canillas Nov 16 '17 at 12:35

1 Answers1

0

You have bytes, binary data, that represent text in some character set. For that you need a charset detection. Knowing the Charset you can load it in a java String (Unicode) and save it as bytes given any Charset you want.

If that target Charset cannot represent the Unicode symbol (code point), then one might even determine how that is handled. See CharsetDecoder/CharsetEncoder.

For Charset detection there exist some libraries. I wrote my own for a partial set of charsets & languages. It works best in combination with language detection. For instance for Czech.

See What is the most accurate encoding detector?

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138