equals()
is just a method. The implementation of this method is code which can do whatever it wants. equals
is not some magic voodoo baked into the JVM itself that returns true if every field (as you pasted them from your debugger tool, kudos for using that by the way!) is identical.
In particular, that means that 2 strings whose value
are non-identical can still be equal. coder
is a flag that registers what's in that byte array. coder=0 means ASCII is in there, coder=1 means java's UCS16 variant is in there. The 0s in that value (after 67
, 72
etcetera) are not the problem. If the -1
and -2
hadn't been there, you'd have been fine, here. If you had pasted:
value = [67, 0, 72, 0, 78, 0, 83, 0]
coder = 1
and another string with:
value = [67, 72, 78, 83]
coder = 0
Then they would have been equals
to each other!
In general this 'lets look at bytes' thing is not a great way to debug strings like this.
So what IS the problem
There is a BOM in there. That's what that -1
, -2
is. Your debugger shows the character values as signed decimal numbers. -1 is 0xFF, -2 is 0xFE. 0xFF 0xFE is the unicode for 'byte order mark' (BOM). That means there's a BOM character in your input file, which is part of the string. It's invisible when you print the character (hence, in your debugger the strings LOOK the same), but it 'counts' for .equals
, so java does not think that "CHSM" and "[invisible byte order mark]CHSM" are identical.
To fix it you'd have to strip that BOM off, that seems like the easiest solution.
Unicode? BOM? What does this stuff mean??
Unicode is a gigantic table of characters. Just like ASCII has some really weird pseudo-characters (such as item 9, which is a tab, or item 127 which is a delete, or item 7, which is 'generate an audible tone' and not a character at all), unicode has those too. A lot of them, in fact.
One of them is the so-called 'Byte Order Mark'. It is a non-character. It's 0xFEFF, and the trick is, 0xFFFE is not a character at all (one of the few defined as: Does not exist, will never exist, this cannot possibly show up in any sequence of unicode values).
The reason it exists is that on certain systems, its not clear if numbers are sequenced in little endian or big endian style. Because 0xFF, 0xFE cannot exist, if you see that in a stream that you know is 16-bit unicode sequences, then you know it's little endian (the least significant byte comes first). Hence, a 'byte order' mark - it shows you the byte order.
Because of this history, it's used as a sort of ersatz 'identifier of unicode'. Some folks like sticking this character (which prints as nothing, not even a blank area. It is completely invisible!) at the front of text as a sort of identifier: This is unicode formatted.
And that is exactly what happened here.
Thus, your string is "xCHNS", where the x is the byte order mark. You don't see it in your debug rendering because the byte order mark is to be rendered as nothing at all as per unicode spec. So, there's an invisible character in there.
Nevertheless, java says: You have one string consisting of the character C, then H, then N, then S. You have another string consisting of the character "Byte Order Mark", then C, then H, then N, then S. Clearly, not the same string.
You can test this. Just run in.length()
on the thing from the file and you'll find a mysterious '5' answer for your CHNS string which sure seems like that's only 4 characters.
How do I fix it?
Strip the byte order mark off. This is not particularly difficult:
String code = details[0].trim();
if (code.charAt(0) == '\ufeff') code = code.substring(1);
// carry on as normal