Java CP1252 to UTF8

Question

I have a spreadsheet (.xls) with car plate numbers in encoding windows-1252, BUT originally those numbers were inputted in cyrillic in encoding UTF-8. What I mean: i.e. У992НВ in cyrillic is the same Y992HB in latin (there is a difference between first letters) So, I take those numbers and convert it

 if (cell.getCellType() == CellType.STRING) {
                    String cellValue = cell.getStringCellValue();
                    try {
                        byte[] b = cellValue.getBytes("windows-1252");
                        String cellValue2 = new String(b, StandardCharsets.UTF_8);
                        cell.setCellValue(cellValue2);
                    }
                    catch ( UnsupportedEncodingException ex) {

                    }

But, output is wrong. Input data in windows-1252 is "Ð¢313ÐÐš777" and output is Т313�К777, because middle sign is unreadable. What am I doing wrong?

What you are doing is wrong: you are converting a string to a byte array encoded with windows-1252 and then you decode it again, pretending that it's UTF-8. Which it isn't, because you've just encoded it as windows-1252... As if you write something down in Russian and then tell someone "Read this, it's in English!". Telling someone that it's in English doesn't magically make it English... — Jesper, Oct 05 '18 at 14:17
Possible duplicate of [Encoding conversion in java](https://stackoverflow.com/questions/229015/encoding-conversion-in-java) — KevinO, Oct 05 '18 at 14:25
The real issue seems to be that the original xls file is wrong - can you go back to whoever created the xls file and ask them to correct it? — PJ Fanning, Oct 05 '18 at 18:29
In addition to @PJ Fanning: "What I mean: i.e. У992НВ in Cyrillic is the same Y992HB in Latin": No, the Cyrillic У(U) as in Утро has absolutely nothing to do with Y, neither has the Cyrillic Н(EN) as in Новости something to do with H nor has the Cyrillic В(VE) as in Вечер something to do with B. — Axel Richter, Oct 07 '18 at 06:02
@PJFanning, nope. Because software was developed in Europe for their specific equipment. — Rostislav Aleev, Oct 08 '18 at 11:29
@AxelRichter, of course if you looks at it so. Specific of car plates is so, that it doesn't use cyrillic signs that can't be read in countries with non-cyrillic language. So У reads as Y. You won't see any cyrillic symbols like Ш, Л — Rostislav Aleev, Oct 08 '18 at 11:33
Then Y992HB should never be У992НВ. And if it is, then this is wrong and nothing what could be corrected using different encodings. Latin and Cyrillic letters cannot be replaced each other that simple. Or how would you replacing the latin letters V, W, X or Z using Cyrillic letters? — Axel Richter, Oct 08 '18 at 11:54
maybe you need a custom transliteration like in https://stackoverflow.com/questions/16273318/transliteration-from-cyrillic-to-latin-icu4j-java — PJ Fanning, Oct 09 '18 at 19:07
I'm still looking how to fix my problem. I'm going to try Jython, because there is a ftfy package for fixing broken UTF8 chars. — Rostislav Aleev, Oct 16 '18 at 09:51

score 0 · Answer 1 · answered Oct 19 '18 at 09:54

Java's byte is not a byte. So byte by byte decoding didn't work.
I parsed symbols dex values and tried to decode them by matching values with UTF8. Some values were equivalent only to UTF-8 latin-1. I found package for python to decode broken UTF-8. It works. BUT: It doesn't work with jython 2.7, because maintainer gave up supporting Python 2.7

Thanks for your help.

Java CP1252 to UTF8

1 Answers1