Java library to fix incorrectly encoded text using heuristics

Question

I'm dealing with an external web service that is giving me incorrectly encoded (and or corrupted) Strings (UTF-8) that were most likely either ISO LATIN or WINDOWS-1252 but are now UTF-8 (and or a mixture of ISO/WINDOWS/UTF-8). Lovely A hats (Â) abound.

I obviously cannot fix how the external web service stores its strings so the information is lost. Thus hopes of a 100% translation I know are not possible.

But I was hoping that someone had written a heuristic character mapping library in Java (its unlikely some one would type A hats).

If not I guess I can port this guys PHP code: https://stackoverflow.com/a/3521340/318174

UPDATE and Explanation: A simple conversion like @VGR answered with will not work. I do not have the original bytes. The data was converted incorrectly at the endpoint (SOAP server maybe getBytes(/*with out correct encoding*/) was done or maybe the data is stored in the incorrect format). When you convert bytes to Strings in Java back forth the data is not retained unless the encoding is the same everywhere. This is easy to understand if you think of something like ASCII <-> UTF-8. With Windows-1252 or ISO Latin its much more complicated because data is not lost but often confused. That is because those encodings can be two bytes and are not a subset of UTF-8.

If you don't believe me you can try doing getBytes() back in forth with various encodings and will see data corruption and data loss.

I shouldn't let bother me but it always annoys me when some one votes to close with out writing a comment. — Adam Gent, Dec 15 '12 at 02:03

VGR · Answer 1 · 2012-12-15T12:23:00.937

0

I may be misunderstanding the nature of the incorrectly encoded data, but that PHP code seems like overkill to me. If you have UTF-8 bytes that were passed as individual characters, you should be able to just do:

String fix(String s) {
    byte[] bytes = s.getBytes(Charset.forName("windows-1252"));
    return new String(bytes, StandardCharsets.UTF_8);
}

edited Dec 15 '12 at 12:23

answered Dec 15 '12 at 01:14

VGR

40,506
4
48
63

That does not work because the data is already corrupt. If I had the original bytes then that would work. Believe me what you have listed is something I am very aware of. – Adam Gent Dec 15 '12 at 02:00
@AdamGent This is what the PHP code does... though it should use Windows-1252 instead of ISO-8859-1. Can you show example of what you have and what it's supposed to be? – Esailija Dec 15 '12 at 06:31
You're right; code updated. I was thinking that all UTF-8 bytes are also valid ISO-8859-1 characters, but that's not the case. – VGR Dec 15 '12 at 12:24
@Esailija The above is not at all what the PHP code is doing. The PHP code is replacing character bytes based on some heuristic mappings. The reason it does this is because the code is assuming a mixture of 2 byte latin/window code with potential 4 byte UTF-8 which is sort of my problem. The Java code above takes the bytes puts them as Unicode which is not UTF-8 and then goes from Unicode to whatever your character encoding you desire is based on a giant mapping table. – Adam Gent Dec 15 '12 at 15:04
I'm not clear on why this isn't sufficient. If you know the encoding in which the original mis-encoded bytes were supposed to be encoded, it is completely safe and reliable to decode them using a known charset. Unlike calling String.getBytes with no argument, calling String.getBytes with an argument is far more reliable and is guaranteed to produce predictable results. No information will be lost. Is the problem that your service has no way to know what the original encoding was supposed to be? – VGR Dec 16 '12 at 13:37

Java library to fix incorrectly encoded text using heuristics

1 Answers1