Why is this Java encoding UTF-8 --> Latin1 wrong?

Question

I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~YÃ¡gua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where DescriÃ§Ã£o fÃsicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.

It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.

This is the code I use to download (downloaded file is in infofile):

                fos = new FileOutputStream(infotxt);
                out = new OutputStreamWriter(fos, 'Latin1');
                fis = new FileInputStream(infofile);
                br = new BufferedReader(new InputStreamReader(fis));
                while ((line = br.readLine()) != null) {
                    out.write("\n"+line.trim());
                }
                br.close();
                out.close();
                fis.close();
                fos.close();

`?` usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. `iconv` probably has a better mapping table between the two encodings. — leonardkraemer, Jan 02 '19 at 16:56
The `?` appears in the (correctly written Latin1) downloaded file. — Luis A. Florit, Jan 02 '19 at 17:00
Then the first part of my comment is the answer. `OutputStreamWriter` has no mapping for the specific character from `UTF-8` to `Latin1`. see https://stackoverflow.com/questions/652161/how-do-i-convert-between-iso-8859-1-and-utf-8-in-java — leonardkraemer, Jan 02 '19 at 17:04
Exactly. But then why the `TRANSLIT` in `iconv` did a perfect job, and how can I simulate that in Java? Maybe something like this: https://stackoverflow.com/a/5807419/1483390 — Luis A. Florit, Jan 02 '19 at 17:16
I guess you will have to fumble with [CharsetEncoder](https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html) or any other solution. You could run [iconv to android with ndk](https://stackoverflow.com/questions/10004077/how-to-install-libiconv-for-android-ndk), but that will give you even more problems. — leonardkraemer, Jan 02 '19 at 17:23

score 3 · Answer 1 · answered Jan 02 '19 at 18:00

3

The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 02 '19 at 18:00

Ralf Kleberhoff

6,990
1
13
7

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the `line` variable in my code? – Luis A. Florit Jan 02 '19 at 18:36
I meant, `line.replaceAll("[\\u2018\\u2019]", "'"))` is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: http://git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def – Luis A. Florit Jan 02 '19 at 19:13
1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple `line.replaceAll()`. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar. – Ralf Kleberhoff Jan 02 '19 at 19:27
Yes, that is what I did, multiple `line.replaceAll()` after looking at the `translit.def`. I guess only a handful may appear. Thanks!! – Luis A. Florit Jan 02 '19 at 19:30

Why is this Java encoding UTF-8 --> Latin1 wrong?

1 Answers1