Weird JAVA UTF-8 encoding behaviour, new String(bytes,"UTF-8") gives different results on mostly similar setup

Question

My application needs to support Japanese character, so we have used UTF-8 as default encoding on the entire stack. We are facing a wierd issue where the new String ( bytes, "UTF-8") gives a different results.

Input from user: 東京
Base64 encoded string generated from Browser and sent to API: 5p2x5Lqs
Both system generate same byte array.
But only on system 1 decoded string comes as 東京
On System 2 the decoded string comes as ??

System 1:
Container: Tomee 7.1.0
JDK: 1.8.0_201-b09
OS Version: 3.10.0-957.12.2.el7.x86_64
Architecture: amd64
Locale:

[logs]$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
[logs]$ locale status
locale: unknown name "status"

System 2:
Container: Tomee 7.1.0
JDK: 1.8.0_201-b09
OS Version: 2.6.32-696.18.7.el6.x86_64
Architecture: amd64
Locale:

[ logs]$ locale
LANG=en_GB
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=
[ logs]$ locale status
locale: unknown name "status"

Java code being used

LogUtil.logMessage("searchString before decoding="+searchString);
                 //s =  new String (Base64.decodeBase64(searchString),utf8_test);
                 byte[] decodedBytes=Base64.getDecoder().decode(searchString);
                 byte[] decodedBaytesFromapache=org.apache.commons.codec.binary.Base64.decodeBase64(searchString);
                 System.out.println("java native array :: ");
                 for(byte b:decodedBytes)
                 {
                     System.out.print(b);
                 }
                 System.out.println("\njava apache array :: \n");
                 for(byte b:decodedBaytesFromapache)
                 {
                     System.out.print(b);
                 }
                 s=new String(decodedBytes,"UTF-8"); //Charset.forName("UTF-8") was also tried here
                 System.out.println("\n String post decode:: "+s);
                 System.out.println("");
            //String s = 
                 System.out.println("loaded charset is utf-8:: "+Charset.isSupported("UTF-8"));
                 Set<String> listOfCharsets=Charset.availableCharsets().keySet();
                 System.out.println("Listing supported charsets:: ");
                 for(String item: listOfCharsets)
                 {System.out.println(item); }

Output on System1

searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::

-26-99-79-28-70-84
 String post decode:: 東京

loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=東京

Output on System 2

searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::

-26-99-79-28-70-84
 String post decode:: ??

loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=??

The ?? are not due to terminal window, as both were taken from same putty terminal with all matched settings. The ?? is then passed to jdbctemplate which returns 0 results on System 2, while on System 1 we are getting the expected result. What is the possible solution to make the decoding consistent on all systems?

missing the important information: what is really in the string: `System.out.println(Arrays.toString(s.toCharArray()))` or similar ((BTW `println("")` is same as `println()`)) — user85421, Jan 28 '20 at 06:51
@user85421 the text is japanese characters. sample used in example 東京. I have updated the informatoin — Harshit, Jan 28 '20 at 07:22
I guess you mention putty because you log in remotely into systems 1 and 2, they are likely Linux systems and you probably run the Java application from a shell. If so, likely the locale settings of the shells are different. Type `locale` or `locale status` on the command line... — Codo, Jan 28 '20 at 07:43
@Codo updated the question with locale output. locale status give me an error. — Harshit, Jan 28 '20 at 07:54
So `locale` output is different and explains the behavior. The Terminal does not know about utf-8 on System2 — Christoph Bauer, Jan 28 '20 at 08:00
@ChristophBauer but why would sql query fail. terminal may not know the encoding, but the value is sent to jdbctemplate within java. also when selecting the records from query the same text is displayed just fine on the same session — Harshit, Jan 28 '20 at 08:06
The problem is `System.out`. It chooses an encoring depending on the locale environment variables. On system 2, the variable values are missing information about encoding. So likely Java picks the wrong one. — Codo, Jan 28 '20 at 11:30
please re-read my comment - I have not mentioned/asked what characters there are - I was interested in the numeric content of the strings on each system (numbers are more likely displayable independent from OS, the only apparent difference) - I do not think that decoding is the problem (in my comment I already added an example code....) — user85421, Jan 28 '20 at 15:59

score 1 · Answer 1 · answered Feb 25 '20 at 08:31

As suggested in one of the comments, your problem is probably due to your use of System.out(). The variable System.out() is a PrintStream which may be using the default encoding of the JVM, which may or may not be UTF-8. See the unresolved OpenJDK bug JDK-8187041 Use UTF-8 as default Charset for more information on that. The summary of that bug report states (with my emphasis):

Use UTF-8 as the Java virtual machine's default charset so that APIs that depend on the default charset behave consistently across all platforms.

Also see the SO question Default character encoding for java console output.

Also note that the locale data for your two systems is different. For example: LANG=en_GB.UTF-8 on the system where the Japanese characters rendered correctly, compared to LANG=en_GB on the system where the Japanese characters did not render correctly.

To avoid the potential issue of the JVM on one system not using UTF-8 encoding by default, simply create your own PrintStream for output which explicitly uses UTF-8:

import java.io.PrintStream;
import java.nio.charset.StandardCharsets;

...

    // Write the output to a UTF-8 PrintStream:
    PrintStream ps = new PrintStream(System.out, true, StandardCharsets.UTF_8.name());
    ps.println("java native array :: ");
    // etc...

Notes:

Creating a UTF-8 String is fine, but that in itself provides no guarantee that it will render correctly.
One of the statements your code logged was loaded charset is utf-8:: true, but it only logged true because Charset.isSupported("UTF-8") returned true. Supporting a specific charset says nothing about whether it is being used (or "loaded", to borrow your term). As your output showed, you had dozens of supported charsets. The crucial point is to actually use UTF-8 for rendering the Japanese characters.

If changing the println() calls doesn't resolve your issue please update your question accordingly.

Weird JAVA UTF-8 encoding behaviour, new String(bytes,"UTF-8") gives different results on mostly similar setup

1 Answers1