Comparing Strings with different byte order masks in Java

Question

In my Java program, I have two strings s1 and s2, when they are printed they both look equal, however, because they are encoded differently s1.equals(s2) returns false. How would I compare these two strings so that even if they are encoded differently they would still be equal?

Look at this example code:

    s1 = s1.trim();
    s2 = s2.trim();
    byte[] s1bytes = s1.getBytes();
    byte[] s2bytes = s2.getBytes();
    System.out.println(s1+","+s2+","+s1.equals(s2));

    System.out.println("\ns1's bytes are:");
    for (int i = 0; i < s1bytes.length; i++) {
        System.out.println(s1bytes[i]);
    }

    System.out.println("\ns2's bytes are:");
    for (int i = 0; i < s2bytes.length; i++) {
        System.out.println(s2bytes[i]);
    }

This prints:

SHEOGMIOF,SHEOGMIOF,false

s1's bytes are:
-17
-69
-65
83
72
69
79
71
77
73
79
70

s2's bytes are:
83
72
69
79
71
77
73
79
70

As you can see when printed s1 and s2 look the same, when compared they are are not equal and both of their byte arrays are different.

EDIT: My question is different from this question because I am not reading data in from a file, the source code in the .java file is encoded differently not the data from another file.

What do you mean "encoded differently"? Java strings are always encoded in UCS-2. — Silvio Mayolo, Jul 14 '18 at 00:44
encoding refers to their representation when they are written to disk or any other kind of "byte array". My guess is that your strings have some white space or other non visible characters which make them look the same on screen but are actually different. Try printing the value of each character as an integer and see where the difference lies. — slipperyseal, Jul 14 '18 at 00:48
Russian language for example also has letters a, o, c and some others, maybe some of your letters are in a different language. — Coder-Man, Jul 14 '18 at 00:56
@Silvio Mayolo I have edited my post to include an example. Is this not an encoding problem? If it isn't let me know an I'll change the title. — Tacodiva, Jul 14 '18 at 01:03
It is probably a *decoding* problem, in the steps that decide from your external representations to `Strings`. The strings themselves are not encoded differently. — user207421, Jul 14 '18 at 01:06
see “EF BB BF” at the beginning of JSON files created in Visual Studio https://stackoverflow.com/questions/44098326/ef-bb-bf-at-the-beginning-of-json-files-created-in-visual-studio — slipperyseal, Jul 14 '18 at 01:15
Possible duplicate of [Unknown characters](https://stackoverflow.com/questions/7494570/unknown-characters) — flakes, Jul 14 '18 at 01:16
Read this - https://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java — Amit K. Saha, Jul 14 '18 at 01:27
@Amit K. Saha I just edited my post, I am not reading data in from a file, it's the actual .java file that's encoded this way. — Tacodiva, Jul 14 '18 at 01:30

score 3 · Accepted Answer · answered Jul 14 '18 at 01:23

Remove the byte order mask (BOM) from the strings as you read them from a file. The character code for this is "\uFEFF"

public class Foo {
    public static void main(final String[] args) {
        final byte[] b1 = {-17, -69, -65, 83, 72, 69, 79, 71, 77, 73, 79, 70};
        final byte[] b2 = {83, 72, 69, 79, 71, 77, 73, 79, 70};

        final String s1 = new String(b1).replace("\uFEFF", "");
        final String s2 = new String(b2).replace("\uFEFF", "");

        System.out.println(s1);
        System.out.println(s2);
        System.out.println(s1.equals(s2));
    }
}

prints:

SHEOGMIOF
SHEOGMIOF
true

collapsar · Answer 2 · 2018-07-14T01:42:36.477

The samples from the question didn't actually differ in their encodings but in the presence/absence of the byte order mark.

The following class demonstrates how to handle the case when the byte sequences do indeed represent different string encodings. In the example code, the encodings must be known. Note that in general it is a non-trivial task to deduce the encoding from the byte sequence alone.

//  https://stackoverflow.com/questions/229015/encoding-conversion-in-java
//

import java.lang.*;
import java.io.*;
import java.nio.*;

public class encotest {
    public static void main(String[] args) {
        // German lowercase umlauted vowels (äöü) as octet sequences in 2 different encodings
        byte[]  raw_iso8859_15  = { (byte) 0xE4, (byte) 0xF6, (byte) 0xFC };
        byte[]  raw_utf8        = { (byte) 0xC3, (byte) 0xA4, (byte) 0xC3, (byte) 0xB6, (byte) 0xC3, (byte) 0xBC };

        try {
            String s_umlauts_from_iso   = new String(raw_iso8859_15 , "ISO-8859-15");
            String s_umlauts_from_utf8  = new String(raw_utf8       , "UTF-8");

            if (s_umlauts_from_iso.equals(s_umlauts_from_utf8)) {
                System.out.println("They are the same !");
            }
            else {
                System.out.println("They differ!");
            }
        } catch (UnsupportedEncodingException uee) {
            System.out.println("Error: cannot convert");
        }
    }
}

Expected output:

They are the same !

@flakes You're right, added expected program output to the answer. Thank you. — collapsar, Jul 14 '18 at 01:43

Comparing Strings with different byte order masks in Java

2 Answers2