Unable to decode Java UTF8

Question

My input for a Java program is the charcode for 年ネン Japanese string. I need to convert it back to Japanese. I tried getBytes(UTF8) but couldnt succeed to convert it back to Japanese. Can you please help

I tried getBytes(UTF8).

My input is 年 ネン which corresponds to the Japanese characters "年ネン".

There are many steps involved before you can see the expected results, and each of it can play a role with your problem. Please, show us a small, complete code example, something we can compile and run on our computers. Also, do you run in an IDE, in a Windows CMD window or where? This is also quite important. — Ralf Kleberhoff, Jun 20 '23 at 06:56
Does this answer your question? [How can I unescape HTML character entities in Java?](https://stackoverflow.com/questions/994331/how-can-i-unescape-html-character-entities-in-java) — JosefZ, Jun 20 '23 at 15:13
@Josefz Your proposed duplicate is for character entities (as explicitly stated in its title), but this question is about unescaping numeric entities. — skomisa, Jun 20 '23 at 22:48

score 1 · Answer 1 · answered Jun 20 '23 at 22:54

As @skomisa has said, it's probably better to use a proper library for this, but if you want something quick-and-dirty, then:

import java.util.Arrays;

public class DecimalEntityDecoder {
    public static void main(String[] args) {
        try {
            String decimalEntities = args[0];
            String output = Arrays.stream(decimalEntities.replaceAll("\\s+", "").split(";")).
                map(s ->s.replaceAll("\\D", "")).
                mapToInt(Integer::valueOf).
                collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).
                toString();
            System.out.println(output);
        }
        catch(Throwable t) {
            t.printStackTrace();
        }
    }
}

That works fine, and I'm now thinking your quick-and-dirty approach is probably better than using a library. The sample data in the OP was clean, but it's nice to be in complete control of the process if there are issues with the input. — skomisa, Jun 21 '23 at 06:27

score 0 · Answer 2 · answered Jun 20 '23 at 22:42

I don't know of any JDK method to resolve your issue, but you can do it in one line using Apache's StringEscapeUtils.unescapeHtml4() method, and the code is trivial. Here's an example using your sample data:

import org.apache.commons.text.StringEscapeUtils;

public class Main {

    public static void main(String[] args) throws Exception {
        String input = "&#24180; &#12493;&#12531;";
        System.out.println("input=[" + input + "]");
        String output = StringEscapeUtils.unescapeHtml4(input);
        System.out.println("output=[" + output + "]");
    }
}

Here's the output:

input=[&#24180; &#12493;&#12531;]
output=[年 ネン]

I ran the code using JDK 20 in Intellij IDEA.

An alternative pure Java approach would be to write the code yourself (but why bother?):

Use a StringTokenizer to extract each numeric code point from the input.
Convert each extracted code point to a character using Character.toString(int codePoint).

Notes:

The javadoc for unescapeHtml4() does not explicitly state that it supports numeric entities in the format you specify them, but it seems to work fine.
The StringEscapeUtils class can be found in both org.apache.commons.text and org.apache.commons.lang. Be sure to use the org.apache.commons.text implementation.
You will need to download the Apache Commons Text library and add it to your project, and import org.apache.commons.text.StringEscapeUtils; in your code.
You will also need to download the Apache Commons Lang library and add it to your project. Otherwise you will get a java.lang.ClassNotFoundException: org.apache.commons.lang3.Range at runtime.
Any spaces in the input will be preserved by unescapeHtml4() in the output.

Unable to decode Java UTF8

2 Answers2