String UTF8 encoding issue

Question

The following simple test is failing:

assertEquals(myStringComingFromTheDB, "£");

Giving:

Expected :£
Actual   :Â£

I don't understand why this is happening, especially considering that is the encoding of the actual string (the one specified as second argument) to be wrong. The java file is saved as UTF8.

The following code:

System.out.println(bytesToHex(myStringComingFromTheDB.getBytes()));
System.out.println(bytesToHex("£".getBytes()));

Outputs:

C2A3
C382C2A3

Can anyone explain me why?

Thank you.

Update: I'm working under Windows 7.

Update 2: It's not related to JUnit, the following simple example:

byte[] bytes = "£".getBytes();
for(byte b : bytes)
{
    System.out.println(Integer.toHexString(b));
}

Outputs:

ffffffc3
ffffff82
ffffffc2
ffffffa3

Update 3: I'm working in IntelliJ Idea, I already checked the options and the encoding is UTF8. Also, it's written in the bottom bar and when I select and right click the pound sign it says "Encoding (auto-detected): UTF-8".

Update 4: Opened the java file with a hex editor and the the pound sign is saved, correctly, as "C2A3".

score 3 · Accepted Answer · edited May 23 '17 at 11:56

3

Please note that assertEquals accepts parameters in the following order:

assertEquals(expected, actual)

so in your case string coming from DB is ok, but the one from your Java class is not (as you noticed already). I guess that you copied £ from somewhere - probably along with some weird characters around it which your editor (IDE) does not print out (almost sure). I had similar issues couple of times, especially when I worked on MS Windows: e.g. ctrl+c & ctrl+v from website to IDE.

(I printed bytes of £ on my system with UTF8 encoding and this is C2A3):

for (byte b: "£".getBytes()) {
  System.out.println(Integer.toHexString(b));
}

The other solution might be that your file is not realy UTF-8 encoded. Do you work on Windows or some other OS?

Some other possible solutions according to the question edits:

1) it's possible that IDE uses some other encoding. For eclipse see this thread: http://www.eclipse.org/forums/index.php?t=msg&goto=543800&

2) If both IDE settings and final file encodings are ok, than it's compiler issue. See: Java compiler platform file encoding problem

edited May 23 '17 at 11:56

Community

1
1

answered Feb 25 '12 at 15:37

omnomnom

8,911
4
41
50

Thank you for your answer, @PiotrekDe. I thought the same about the ctrl+c & ctrl+v, but I input it manually with the keyboard and I still face the problem. I'm using Windows 7. This problem is reeeally bizarre and it's freaking me out! – satoshi Feb 25 '12 at 15:43
So maybe your file is not really UTF8 encoded? Do you work with some IDE? If it's Eclipse, you can set default encoding for new files from Windows > Preferences > General > Content Types). See this thread: http://www.eclipse.org/forums/index.php?t=msg&goto=543800& – omnomnom Feb 25 '12 at 15:47
I'm working in IntelliJ Idea, the encoding is UTF8. I already checked the options and it's UTF8. Also, it's written in the bottom bar and when I select the sterling pound sign it says "Encoding (auto-detected): UTF-8". – satoshi Feb 25 '12 at 15:51
So it's really weird ;) I'd try to copy-paste '£' from somewhere now - maybe that's some strange Windows keyboard layout issue(?). Or you can try to write this code in some other editor (standard Windows notepad) and compile it using javac by hand - just to eliminate that it's IDE / IDE compiler used issue. – omnomnom Feb 25 '12 at 15:56
See the update 4 in my original question, the pound sign is correctly saved as UTF8 in the java file. :( Thanks for all the help you're giving me, @PiotrekDe! :) – satoshi Feb 25 '12 at 15:59
2

So if it's not keyboard, nor IDE and nor file the last chance is compiler :) : http://stackoverflow.com/questions/4927575/java-compiler-platform-file-encoding-problem – omnomnom Feb 25 '12 at 16:01
1

You were right, now it's working perfectly! I checked the compilator and for some reason it was using `aspectjtools-1.6.10.jar` instead of `javac`. I also added the argument `-encoding UTF-8`. If you change your answer or add a new one I will +1 and accept it :) Thanks! – satoshi Feb 25 '12 at 16:15
2

If UTF-8 encoding is a part of the file format's specification, it would be safer to use myStringComingFromTheDB.getBytes("UTF-8"). The parameterless String.getBytes() uses platform encoding, so results may be different on different machines and you run the risk of tricky errors (for example everything may work fine on development machine but stop working in production where the machine accidentally uses a different locale). The same goes for the opposite operation, i.e. creating a String from byte array. – Michał Kosmulski Feb 25 '12 at 16:35
@satoshi: glad do hear it ;) Answer has been edited. (Please take a look at Michal's suggestion above also) – omnomnom Feb 25 '12 at 16:42

String UTF8 encoding issue

1 Answers1

Linked