-1

My input for a Java program is the charcode for 年 ネン Japanese string. I need to convert it back to Japanese. I tried getBytes(UTF8) but couldnt succeed to convert it back to Japanese. Can you please help

I tried getBytes(UTF8).

My input is 年 ネン which corresponds to the Japanese characters "年 ネン".

skomisa
  • 16,436
  • 7
  • 61
  • 102
user273121
  • 27
  • 1
  • My input is 年 ネン – user273121 Jun 20 '23 at 05:16
  • 4
    There are many steps involved before you can see the expected results, and each of it can play a role with your problem. Please, show us a small, complete code example, something we can compile and run on our computers. Also, do you run in an IDE, in a Windows CMD window or where? This is also quite important. – Ralf Kleberhoff Jun 20 '23 at 06:56
  • Does this answer your question? [How can I unescape HTML character entities in Java?](https://stackoverflow.com/questions/994331/how-can-i-unescape-html-character-entities-in-java) – JosefZ Jun 20 '23 at 15:13
  • You've tried what, so far? – g00se Jun 20 '23 at 16:55
  • @Josefz Your proposed duplicate is for character entities (as explicitly stated in its title), but this question is about unescaping numeric entities. – skomisa Jun 20 '23 at 22:48

2 Answers2

1

As @skomisa has said, it's probably better to use a proper library for this, but if you want something quick-and-dirty, then:

import java.util.Arrays;

public class DecimalEntityDecoder {
    public static void main(String[] args) {
        try {
            String decimalEntities = args[0];
            String output = Arrays.stream(decimalEntities.replaceAll("\\s+", "").split(";")).
                map(s ->s.replaceAll("\\D", "")).
                mapToInt(Integer::valueOf).
                collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).
                toString();
            System.out.println(output);
        }
        catch(Throwable t) {
            t.printStackTrace();
        }
    }
}
g00se
  • 3,207
  • 2
  • 5
  • 9
  • That works fine, and I'm now thinking your quick-and-dirty approach is probably better than using a library. The sample data in the OP was clean, but it's nice to be in complete control of the process if there are issues with the input. – skomisa Jun 21 '23 at 06:27
0

I don't know of any JDK method to resolve your issue, but you can do it in one line using Apache's StringEscapeUtils.unescapeHtml4​() method, and the code is trivial. Here's an example using your sample data:

import org.apache.commons.text.StringEscapeUtils;

public class Main {

    public static void main(String[] args) throws Exception {
        String input = "年 ネン";
        System.out.println("input=[" + input + "]");
        String output = StringEscapeUtils.unescapeHtml4(input);
        System.out.println("output=[" + output + "]");
    }
}

Here's the output:

input=[年 ネン]
output=[年 ネン]

I ran the code using JDK 20 in Intellij IDEA.

An alternative pure Java approach would be to write the code yourself (but why bother?):

Notes:

  • The javadoc for unescapeHtml4() does not explicitly state that it supports numeric entities in the format you specify them, but it seems to work fine.
  • The StringEscapeUtils class can be found in both org.apache.commons.text and org.apache.commons.lang. Be sure to use the org.apache.commons.text implementation.
  • You will need to download the Apache Commons Text library and add it to your project, and import org.apache.commons.text.StringEscapeUtils; in your code.
  • You will also need to download the Apache Commons Lang library and add it to your project. Otherwise you will get a java.lang.ClassNotFoundException: org.apache.commons.lang3.Range at runtime.
  • Any spaces in the input will be preserved by unescapeHtml4() in the output.
skomisa
  • 16,436
  • 7
  • 61
  • 102