I'm dealing with an external web service that is giving me incorrectly encoded (and or corrupted) Strings (UTF-8
) that were most likely either ISO LATIN
or WINDOWS-1252
but are now UTF-8
(and or a mixture of ISO/WINDOWS/UTF-8). Lovely A hats (Â
) abound.
I obviously cannot fix how the external web service stores its strings so the information is lost. Thus hopes of a 100% translation I know are not possible.
But I was hoping that someone had written a heuristic character mapping library in Java (its unlikely some one would type A hats).
If not I guess I can port this guys PHP code: https://stackoverflow.com/a/3521340/318174
UPDATE and Explanation: A simple conversion like @VGR answered with will not work. I do not have the original bytes. The data was converted incorrectly at the endpoint (SOAP server maybe getBytes(/*with out correct encoding*/)
was done or maybe the data is stored in the incorrect format). When you convert bytes to Strings in Java back forth the data is not retained unless the encoding is the same everywhere. This is easy to understand if you think of something like ASCII
<-> UTF-8
. With Windows-1252
or ISO Latin
its much more complicated because data is not lost but often confused. That is because those encodings can be two bytes and are not a subset of UTF-8
.
If you don't believe me you can try doing getBytes()
back in forth with various encodings and will see data corruption and data loss.