I'm facing the following situation:
We poll some csv data from an external source. The source's response headers doesn't specify that which is the charset, and the data contains some german characters which are showing as a question mark inside a rombus (I know that means the character is not defined in UTF-8).
We want to do some work with this data, and then forward it, but to fix this issue, we want to also encode the erroneous characters to a correct format to show them properly.
I have read already some answers here and most of them suggested using "string.getBytes("encoding")" method, and then create a new string from this with some other encoding.
But from what I understand I need a different thing, as this method just decodes the characters and process the same bytes in respect to another encoding, and some characters get represented with different byte lengths in utf-8 than for example ISO-8859-1 (which I believe the data we are polling is really encoded in) which causes strange characters appearing in the result string so its not really what we want to achieve.
I would need something which can
- Get the character from a byte representation in a source encoding
- Get the character from a byte representation in the target encoding
- Iterate over the decoded byte array and replace all characters byte representation with the representation from the target encoding
After this it would be safe to create a new string from the byte array with the target encoding. So if anyone knows a good library which can do that? I dont want to implement it myself if its already there.