Java encoding - corrupted French characters

Question

I have a system, where I got French Text from third party, but I am facing hard time to get it readable.

String frenchReceipt = "RETIRï¿½E"; // The original Text should be "RETIRÉE"

I tried all possible combinations to convert the string using encoding of UTF-8 and ISO-8859-1

String frenchReceipt = "RETIRï¿½E"; // The original Text should be "RETIRÉE"

byte[] b1 = new String(frenchReceipt.getBytes()).getBytes("UTF-8"); 
System.out.println(new String(b1));  // RETIRÃ¯Â¿Â½E

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1"); 
System.out.println(new String(b2));  // RETIRï¿½E

byte[] b3 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b3));  // RETIR?E 

byte[] b4 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b4));  //RETIR?E

byte[] b5 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("UTF-8"); 
System.out.println(new String(b5));  //RETIRÃ¯Â¿Â½E

byte[] b6 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("ISO-8859-1"); 
System.out.println(new String(b6));  //RETIR?E

byte[] b7 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("UTF-8"); 
System.out.println(new String(b7));  //RETIRï¿½E

byte[] b8 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("ISO-8859-1"); 
System.out.println(new String(b8));  //RETIRï¿½E

As you see nothing fix the problem.

Please advise.

Update: The third -party partner confirmed that data sent to my application in "ISO-8859-1" Encoding

see https://stackoverflow.com/questions/6543548/whats-going-on-with-this-byte-array. The characters ï¿½ are encoded as EF BF BD, what is mentioned there in the answer. — mayamar, Mar 30 '21 at 18:09
@mayamar Default text file encoding is: "Cp1252". But I tried to change it also "UTF-8" and "ISO-8859-1" but it didn't fix the issue. — R.Almoued, Mar 30 '21 at 19:10

Oleks · Accepted Answer · 2021-03-31T16:12:30.497

2

ï¿½ is just a replacement character (EF|BF|BD UTF-8) and used to indicate problems when a system is unable to render a correct symbol. It means that you have no chance to convert ï¿½ into É.

frenchReceipt doesn't contain any byte sequence which could be converted into É because of the declaration:

String frenchReceipt = "RETIRï¿½E";

Your code snippet below should work pretty fine but you have to use the correct byte source.

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1");
System.out.println(new String(b2));

So if you read "RETIRÉE" by bytes from a data source and get 52|45|54|49|52|C9|45 (ISO-8859-1 is expected) then you'll get the proper result. If the data source has already the byte sequence EF|BF|BD the only option you have is search&replace, but in this case, there is no difference between i.e. ä and É.

Update: Since the data are delivered by TCP

new BufferedReader(new InputStreamReader(connection.getInputStream(),"ISO-8859-1"))

solved the issue.

edited Mar 31 '21 at 16:12

answered Mar 30 '21 at 23:44

Oleks

1,011
1
14
25

Can you tell me exactly what should I do? I didn't understand last part of your answer. Ho to read "RETIRÉE" by bytes from data source and get 52|45|54|49|52|C9|45 ?? – R.Almoued Mar 31 '21 at 00:37
The answer on the question "how to read" depends on the data source (XML, database, binary stream, etc). What is the real data source of "RETIRï¿½E"? – Oleks Mar 31 '21 at 09:02
According to the document it is "plain text" encoded as "ISO-8859-1" – R.Almoued Mar 31 '21 at 10:53
To make it clear: the response I got is series of parameters separated Field Separator (FS) and ends with (EOT) Field. For example: 00[FS]RETIRï¿½E[FS]FR[EOT] Of course there is no brackets [ ] for FS and EOT, just added to be more readable – R.Almoued Mar 31 '21 at 11:00
I asked about data source to make clear the reason why some characters are lost. Is it the trouble of rendering or the data source has already corrupted? So the question is still open for me. If it were the "plain text" encoded as "ISO-8859-1" you would solve the issue easily. – Oleks Mar 31 '21 at 12:50
As far as I can guess you are working with HTTP response. The best option here is encoding the record 00[FS]RETIRÉE[FS]FR[EOT] with base64 on server side and decode on the client. In this case you will have no troubles with "unrecognized" characters. https://stackoverflow.com/questions/3538021/why-do-we-use-base64 Also if you are dealing with HTTP try to ask the server to encode response headers.put("Accept-Encoding", "UTF-8");headers.put("Accept-Encoding", "UTF-8"); – Oleks Mar 31 '21 at 12:50
If it is just a file please open the record in any hex editor and make sure that the sequence EF|BF|BD is not present there. It is not easy to help without real data snipped. – Oleks Mar 31 '21 at 12:51
It is not HTTP, but I connect to the server via TCP protocol, and I am using persistent socket open/close for every request and response. and in the response I got string parameter as the sample I send before: 00[FS]RETIRï¿½E[FS]FR[EOT] – R.Almoued Mar 31 '21 at 13:02
1

TCP connection has no concept of char encoding. Try to read the stream with BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(),"ISO-8859-1")); – Oleks Mar 31 '21 at 13:13
Thanks a lot. That exactly what I was asking for. It fix the issue – R.Almoued Mar 31 '21 at 13:55

Java encoding - corrupted French characters

1 Answers1