Java - read UTF-8 file with a single emoji symbol

Question

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm

F0 9F 98 8A

When I read the file I get two symbols/chars.

The program below prints

?
2
?
?
55357
56842
======================================
&#55357;&#56842;
16
&
======================================
?
2
?
======================================

Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?

EDIT: And also... how do I escape it for XML?

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test008 {

    public static void main(String[] args) throws Exception{
        BufferedReader in = new BufferedReader(
                   new InputStreamReader(
                              new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));
        
        String s = "";
        while ((s = in.readLine()) != null) {
            System.out.println(s);
            System.out.println(s.length());
            System.out.println(s.charAt(0));
            System.out.println(s.charAt(1));
            
            System.out.println((int)(s.charAt(0)));
            System.out.println((int)(s.charAt(1)));
            
            String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
            String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
            
            System.out.println("======================================");
            System.out.println(z);
            System.out.println(z.length());
            System.out.println(z.charAt(0));
            
            System.out.println("======================================");
            System.out.println(z3);
            System.out.println(z3.length());
            System.out.println(z3.charAt(0));
            
            System.out.println("======================================");

        }

        in.close();
    }

}

Shouldn't the charset be called `"UTF-8"` instead of `"UTF8"`? — f1sh, Jul 28 '20 at 12:05
@f1sh I think both are OK but will try it... Yeah... same thing. — peter.petrov, Jul 28 '20 at 12:08
Note that you *need not* escape those characters at all in XML, you can just write them as-is, provided you use the correct encoding and the receiving side handles XML correctly. The only characters you *must* escape are the ones that the syntax of XML itself uses (and even those not always, for example `<` doesn't need to be escaped in attribute values, but `&` must be escaped). — Joachim Sauer, Jul 28 '20 at 13:46
@JoachimSauer Thanks... Yes, seems that's what `StringEscapeUtils.escapeXml10` from Apache commons lang 3.11 does. It simply does not escape it. I got it working now, I think. Thanks a lot! — peter.petrov, Jul 28 '20 at 14:44

Joop Eggen · Answer 1 · 2020-07-28T12:47:26.933

4

Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).

int codePoint = s.codePointAt(0); // Your code point.
System.out.printf("U+%04X, chars: $d%n", codePoint, Character.charCount(cp));

U+F09F988A, chars: 2

After comments

Java, using a Stream:

public static String escapeToAsciiHTML(String s) {
    StringBuilder sb = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 128) {
            sb.append((char) cp);
        } else{
            sb.append("&#").append(cp).append(";");
        }
    });
    return sb.toString();
}

edited Jul 28 '20 at 12:47

answered Jul 28 '20 at 12:12

Joop Eggen

107,315
7
83
138

2

..and the character is unprintable on your terminal, so you get `?` as output. – Kayaman Jul 28 '20 at 12:12
OK thanks... then when I escape this string `s` for XML using `System.out.println(org.apache.commons.lang.StringEscapeUtils.StringEscapeUtils.escapeXml(s));` I get this `` This is wrong right? Should I conclude this is a bug in the StringEscapeUtils? – peter.petrov Jul 28 '20 at 12:13
2

@peter.petrov that's `StringEscapeUtils` not doing a proper job. – Kayaman Jul 28 '20 at 12:14
@Kayaman gives a good hint. The console probably cannot show the emoji anyway, and separately any of the two chars are broken. The name of such 2 chars is _surrogate pair_ – Joop Eggen Jul 28 '20 at 12:14
That's fine, I don't care about the console output that much. I want to properly escape this string `s` for XML. Seems this library StringEscapeUtils is messing it up, isn't it? How does one do that in Java then? – peter.petrov Jul 28 '20 at 12:15
See also my previous question: https://stackoverflow.com/questions/63131658/are-these-characters-valid-for-xml The current question is a follow-up to that one. – peter.petrov Jul 28 '20 at 12:17
So... my current guess is that library StringEscapeUtils is messing it up. – peter.petrov Jul 28 '20 at 12:18
@peter.petrov If you do want to print emoji's to console output, see e.g. [Windows 10 CLI UTF-8 encoding](https://stackoverflow.com/a/49016444/5221149). – Andreas Jul 28 '20 at 12:21
I don't care about console output. I want to property escape to XML. I updated my code in the question. Seems like a triumph of buggy libs. – peter.petrov Jul 28 '20 at 12:24
You could write your own surrogate pair finding replace, replaceing the code point with `...;`. Or you write the String to html - if that page is in UTF-8, it can do. – Joop Eggen Jul 28 '20 at 12:25
No, I escape to XML and pass down the XML to a Postgres function. HTML page doesn't help here. – peter.petrov Jul 28 '20 at 12:26
You could write your own surrogate pair finding replace, replaceing the code point with `...;` => That's the thing... I have no idea how to do this, what pairs to look for etc. I am not that much into Unicode. Can you point me to some sample code, or references, or anything that could help me do this in some decent way? – peter.petrov Jul 28 '20 at 12:27
I tried org.apache.commons.lang and org.apache.commons.lang3 both seem buggy, no? – peter.petrov Jul 28 '20 at 12:28
For XML in UTF-8 you can do for instance an OutputStreamWriter with UTF-8. I have added a conversion function too. – Joop Eggen Jul 28 '20 at 12:49
1

When the final `XML`/`Html` file is encoded in UTF-8, there is no need to escape these codepoints at all. But what you need to care about, is to escape, `<`, `>`, `&`, `"` and `'`, which this `escapeToAsciiHTML` method does not handle. On the other hand, if the resulting string is supposed to be passed to another XML writing method doing the correct encoding anyway, there’s even less reason to escape codepoints. – Holger Jul 28 '20 at 13:45
1

Also such a numeric entity exists in XML, so via an XML object one might get "ї` or as String the emoji is again inserted as characters in the on the receiver side. As @Holger said best is to do nothing. – Joop Eggen Jul 28 '20 at 14:04

score 3 · Answer 2 · answered Jul 28 '20 at 12:28

StringEscapeUtils is broken. Don't use it. Try NumericEntityEscaper.

Or, better yet, as apache commons libraries tend to be bad API** and broken*** anyway, guava*'s XmlEscapers

java is unicode, yes, but 'char' is a lie. 'char' does not represent characters; it represents a single, unsigned 16 bit number. The actual method to get a character out of, say, a j.l.String object isn't charAt, which is a misnomer; it's codepointAt, and friends.

This (char being a fakeout) normally doesn't matter; most actual characters fit in the 16-bit char type. But when they don't, this matters, and that emoji doesn't fit. In the unicode model used by java and the char type, you then get 2 char values (representing a single unicode character). This pair is called a 'surrogate pair'.

Note that the right methods tend to work in int (you need the 32 bits to represent one single unicode symbol, after all).

*) guava has its own issues, by being aggressively not backwards compatible with itself, it tends to lead to dependency hell. It's a pick your poison kind of deal, unfortunately.

**) Utils-anything is usually a sign of bad API design; 'util' is almost meaningless as a term and usually implies you've broken the object oriented model. The right model is of course to have an object representing the process of translating data in one form (say, a raw string) to another (say, a string that can be dumped straight into an XML file, escaped and well) - and such a thing would thus be called an 'escaper', and would live perhaps in a package named 'escapers' or 'text'. Later editions of apache libraries, as well as guava, fortunately 'fixed' this.

***) As this very example shows, these APIs often don't do what you want them to. Note that apache is open source; if you want these APIs to be better, they accept pull requests :)

Thanks very much. Seems we include guava already so I guess I will try with it. The xmlAttributeEscaper() method - can I reuse this instance which I get from there? Or do I need to create a new one each time? This whole thing I am working on lives in a pretty complicated multi-threaded backend app. — peter.petrov, Jul 28 '20 at 12:35
I think this `XmlEscapers.xmlAttributeEscaper()` does not work either. — peter.petrov, Jul 28 '20 at 12:38
Or maybe it does... I am so confused, I've been working on this issue the whole day :) Seems `XmlEscapers.xmlAttributeEscaper().escape(s)` produces the same string as `org.apache.commons.lang3.StringEscapeUtils.escapeXml(s)`, I just compared the two strings produced and they are equal (as per equals method, I mean). — peter.petrov, Jul 28 '20 at 12:40
@peter.petrov from the linked documentation of `XmlEscapers`: “*Currently the escapers provided by this class do not escape any characters outside the ASCII character range.*” — Holger, Jul 28 '20 at 14:22
@Holger, Yeah, I noticed it later. Thanks. `org.apache.commons.lang3.StringEscapeUtils.escapeXml10` seems to work best for me! From the 3.11 jar version: https://mvnrepository.com/artifact/org.apache.commons/commons-lang3/3.11 — peter.petrov, Jul 28 '20 at 14:45

Java - read UTF-8 file with a single emoji symbol

2 Answers2

Linked