String from BufferedReader is not equal to string literal

Question

There are a lot of posts from people using == instead of equals but this isn't one of them.

I'm reading a list of codes from a CSV file making sure they are equal to a string literal.

Example row from CSV:

After reading each code, I trim and call toUpper() before placing them inside a map.

private final Map<String, Code> codeMap = new HashMap<>();

private void loadFile() {
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new FileReader("src/main/resources/codes.csv"));
        String line = null;
        while ((line = reader.readLine()) != null) {
            String[] details = line.split(",");
            codeMap.put(details[0].trim().toUpperCase(), new Code(details[0].trim(), details[1].trim(), details[2].trim()));
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (reader != null) {
            try {
                reader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

I also have a method for retrieving a code based on the string passed in:

public Tool getCodeByString(String code){
    return codeMap.get(code.toUpperCase());
}

After the map is populated, I call getCodeByString using "CHNS" and null is returned. I looked in the map and see the key "CHNS" but null is being returned. I can immediately tell the byte arrays are different.

String literal:

Key in map:

Does anyone know how I can fix this and make the value from file equal the literal?

Make sure you set the correct encoding for your Reader before you start parsing. — Ma3x, Jan 07 '22 at 16:01
Looks like different encoding - [this](https://stackoverflow.com/questions/62917183/what-is-coder-in-string-value) might help. — Andrew S, Jan 07 '22 at 16:02
This looks like `codes.csv` was encoded using UTF-16. Try with `new BufferedReader(new InputStreamReader(new FileInputStream("src/main/resources/codes.csv"), "UTF-16"));`. — Pshemo, Jan 07 '22 at 16:24

rzwitserloot · Answer 1 · 2022-01-07T16:45:34.737

equals() is just a method. The implementation of this method is code which can do whatever it wants. equals is not some magic voodoo baked into the JVM itself that returns true if every field (as you pasted them from your debugger tool, kudos for using that by the way!) is identical.

In particular, that means that 2 strings whose value are non-identical can still be equal. coder is a flag that registers what's in that byte array. coder=0 means ASCII is in there, coder=1 means java's UCS16 variant is in there. The 0s in that value (after 67, 72 etcetera) are not the problem. If the -1 and -2 hadn't been there, you'd have been fine, here. If you had pasted:

value = [67, 0, 72, 0, 78, 0, 83, 0]
coder = 1

and another string with:

value = [67, 72, 78, 83]
coder = 0

Then they would have been equals to each other!

In general this 'lets look at bytes' thing is not a great way to debug strings like this.

So what IS the problem

There is a BOM in there. That's what that -1, -2 is. Your debugger shows the character values as signed decimal numbers. -1 is 0xFF, -2 is 0xFE. 0xFF 0xFE is the unicode for 'byte order mark' (BOM). That means there's a BOM character in your input file, which is part of the string. It's invisible when you print the character (hence, in your debugger the strings LOOK the same), but it 'counts' for .equals, so java does not think that "CHSM" and "[invisible byte order mark]CHSM" are identical.

To fix it you'd have to strip that BOM off, that seems like the easiest solution.

Unicode? BOM? What does this stuff mean??

Unicode is a gigantic table of characters. Just like ASCII has some really weird pseudo-characters (such as item 9, which is a tab, or item 127 which is a delete, or item 7, which is 'generate an audible tone' and not a character at all), unicode has those too. A lot of them, in fact.

One of them is the so-called 'Byte Order Mark'. It is a non-character. It's 0xFEFF, and the trick is, 0xFFFE is not a character at all (one of the few defined as: Does not exist, will never exist, this cannot possibly show up in any sequence of unicode values).

The reason it exists is that on certain systems, its not clear if numbers are sequenced in little endian or big endian style. Because 0xFF, 0xFE cannot exist, if you see that in a stream that you know is 16-bit unicode sequences, then you know it's little endian (the least significant byte comes first). Hence, a 'byte order' mark - it shows you the byte order.

Because of this history, it's used as a sort of ersatz 'identifier of unicode'. Some folks like sticking this character (which prints as nothing, not even a blank area. It is completely invisible!) at the front of text as a sort of identifier: This is unicode formatted.

And that is exactly what happened here.

Thus, your string is "xCHNS", where the x is the byte order mark. You don't see it in your debug rendering because the byte order mark is to be rendered as nothing at all as per unicode spec. So, there's an invisible character in there.

Nevertheless, java says: You have one string consisting of the character C, then H, then N, then S. You have another string consisting of the character "Byte Order Mark", then C, then H, then N, then S. Clearly, not the same string.

You can test this. Just run in.length() on the thing from the file and you'll find a mysterious '5' answer for your CHNS string which sure seems like that's only 4 characters.

How do I fix it?

Strip the byte order mark off. This is not particularly difficult:

String code = details[0].trim();
if (code.charAt(0) == '\ufeff') code = code.substring(1);
// carry on as normal

Since the edit queue is full and can't edit it now, in the fix you need equality check (double equal to operator) and not assignment :) — Toni Nagy, Jan 07 '22 at 16:24
Just wondering if Readers set to UTF-16 wouldn't handle BOM for us since it is *metainformation* which probably shouldn't be part of retrieved text. If yes then IMO this would be preferred solution instead of having to check and remove BOM ourselves. — Pshemo, Jan 07 '22 at 16:30

Joop Eggen · Answer 2 · 2022-01-07T17:16:45.550

It seems you are reading a UTF-8 file, with a redundant BOM character \uFEFF (bytes -2, -1).

So you should discard the BOM, and actually read the file as UTF-8 (for any special characters). However as FileReader reads the file in default platform encoding, use an other reading.

Also you are reading from a resource file. This will be packed in the application (jar?), so you should not read it as disk File.

Files.lines uses by default UTF-8.

private void loadFile() {
    Path path = Paths.get(getClass().getResource("/codes.csv").toURI());
    try (Stream<String> lines = Files.lines(path)) {
        lines.forEach(line -> {
            String[] details = line.split("\\s*,\\s*", 3);
            String key = details[0].replace("\uFEFF", "");
            // Replace of BOM would only be needed at the file begin.
            codeMap.put(key.toUpperCase(), new Code(key, details[1], details[2]));
        });
    } catch (IOException e) {
        e.printStackTrace();
    } // Automatic close of lines.
}

The regex of split strips whitespace before and after the comma. Also limited the split to 3 values, so the last field may contain commas as text.

String from BufferedReader is not equal to string literal

2 Answers2

So what IS the problem

Unicode? BOM? What does this stuff mean??

How do I fix it?