Issues in Converting values to UTF-8

Question

I am encountering issues in reporting in displaying names. My application uses different technologies PHP, Perl and for BI Pentaho.

We are using MYSQL as DB and my table is of CHARSET=utf8.

My table is been stored with values in rows as below which is wrong

Row1 = Ãxâ€”350
Row2 = Ã‘zâ€“401

PHP and Perl are using different in built functions to convert the above values which is stored in DB and it is displaying in UI as below which is correct

Expected Row1 = Áx—350
Expected Row2 = Ñz–401

Coming to reports which is using pentaho I am using ETL to transform the data before showing data in reports. In order to convert the above DB stored values I am trying to convert the data through Java step as below

new java.lang.String(new java.lang.String(CODE).getBytes("Windows-1252"), "UTF-8")

But it is not converting the values properly, among the above 2 wrong values only Row2 value is been converted properly but the first Row1 is wrongly converting as below

Converted Row1 = �?x—350
Converted Row2 = Ñz–401

Please suggest what way I can convert the values properly so that for example Row1 value should be converted properly to Áx—350.

I wrote a small Java program as below to convert the Ãxâ€”350 string to Áx—350

String input = "Ãxâ€”350";
byte[] b1 = input.getBytes("Windows-1252");
System.out.println("Input Get Bytes = "+b1.toString());

String szUT8 = new String(b1, "UTF-8");
System.out.println("Input Encoded = " + szUT8);

The output from the above code is as below

Input Get Bytes = [B@157ee3e5
Input Encoded = �?x—350-350—É1

If we see the output the string is wrong where the actual expected output is Áx—350.

To confirm on the encoding/decoding schemes i tried testing string online and tested with string Ãxâ€”350 and output is as expected Áx—350 which is correct.

So from this any one please point why java code is not able to convert properly although i am using the proper encoding/decoding schemes, anything else which iam missing or my approach is wrong.

What exactly is the actual expected value? "Áx—350"? You simply fail to handle UTF-8 correctly all the way through. See http://stackoverflow.com/q/279170/476 and [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/) for starters. — deceze, Aug 03 '16 at 10:54
the code you are using to convert is Java, not JavaScript no? — beasy, Aug 03 '16 at 12:43
`java.lang.String` ... That is ***Java***, *not* Javascript! — Sinan Ünür, Aug 03 '16 at 12:44
Please provide the output of the SQL function `HEX()` for those values. — ikegami, Aug 03 '16 at 15:21
@ikegami Thanks, HEX() values for Ãxâ€”350 = C383C28178C3A2E282ACE2809D333530 and for Ã‘zâ€“401 = C383E280987AC3A2E282ACE2809C343031 — Yog, Aug 04 '16 at 08:08

beasy · Answer 1 · 2016-08-04T11:46:52.600

0

The CHARSET setting in your db being set to utf-8 doesn't necessarily mean that the data there is properly encoded in utf-8 (or even in utf-8 at all), as we can see. It looks like you are dealing with mojibake - characters that that were at one time decoded using the wrong encoding scheme, then therefore in turn encoded wrong. Fixing that is a usually tedious process of figuring out past decode/encode errors and then undoing them.

Long story short: if you have mojibake, there isn't any automatic conversions you can do unless you know (or can figure out) what conversions were made in the past.

Converting is a matter of first decoding, then encoding. To convert in Perl:

my $string = "some windows-1252 string";

use Encode;
my $raw = decode('windows-1252',$string);
my $encoded = encode('utf-8',$raw);

edited Aug 04 '16 at 11:46

answered Aug 03 '16 at 13:07

beasy

1,227
8
16

Thanks, I was checking the encoding and decoding schemes and found from this [link](http://string-functions.com/encodedecode.aspx ) when entering **Ãxâ€”350 string to encode / decode ** as Encode with as ** Windows-1252 ** and decode with as ** utf-8 ** result is correct which is as **Áx—350** So i am not getting although i am `new java.lang.String(new java.lang.String(CODE).getBytes("Windows-1252"), "UTF-8") ` trying to encode and decode with the same schemes i am not getting the desired results. Any suggestions ? – Yog Aug 04 '16 at 08:21
I don't know Java but I'm pretty sure your Java command encodes the string twice. it never decodes. I'm editing my answer to show how decode and encode in Perl – beasy Aug 04 '16 at 11:40

Issues in Converting values to UTF-8

1 Answers1