1

I get a string from my HTML page into my Java HTTPServlet. On my request I get ASCII codes that display Chinese characters:

"& #21487;& #20197;& #21578;& #35785;& #25105;" (without the spaces)

How can I transform this string into Unicode?

HTML code:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Find information</title>
    <link rel="stylesheet" type="text/css" href="layout.css">
</head>
<body>

<form id="lookupform" name="lookupform" action="LookupServlet" method="post" accept-charset="UTF-8">
    <table id="lookuptable" align="center">
        <tr>
            <label>Question:</label>
            <td><textarea cols="30" rows="2" name="lookupstring" id="lookupstring"></textarea></td>
        </tr>
    </table>
    <input type="submit" name="Look up" id="lookup" value="Look up"/>
</form>

Java code:

request.setCharacterEncoding("UTF-8");
javax.servlet.http.HttpSession session = request.getSession();
LoginResult lr = (LoginResult) session.getAttribute("loginResult");
String[] question = request.getParameterValues("lookupstring");

If I print question[0] then I get this value: "& #21487;& #20197;& #21578;& #35785;& #25105;"

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rob Hufschmitt
  • 409
  • 3
  • 12
  • 19

2 Answers2

5

There is no such thing as ASCII codes that display Chinese characters. ASCII does not represent Chinese characters.

If you already have a Java string, it already has an internal representation of all characters (US, LATIN, CHINESE). You can then encode that Java string into Unicode using UTF-8 or UTF-16 representations:

String s = "可以告诉我"; (EDIT: This line won't display correctly on systems not having fonts for Chinese characters)

String s = "\u53ef\u4ee5\u544a\u8bc9\u6211";
byte utfString = s.getBytes("UTF-8");

Now that I look at your updated question, you might be looking for the StringEscapeUtils class. It's from Apache Commons Text. And will unescape your HTML entities into a Java string:

String s = StringEscapeUtils.unescapeHtml("& #21487;& #20197;& #21578;& #35785;& #25105;"); // without spaces
Francis Bartkowiak
  • 1,374
  • 2
  • 11
  • 28
Pablo Santa Cruz
  • 176,835
  • 32
  • 241
  • 292
  • Nu but the string displayed looks like this: "& #21487;& #20197;& #21578;& #35785;& #25105; – Rob Hufschmitt Dec 24 '10 at 12:08
  • @Rob: That is probably your PAGE or RESPONSE encoding. Show us the code you are using to "print" that page into the HTTP RESPONSE, and the encoding you are using for the page and the response. – Pablo Santa Cruz Dec 24 '10 at 12:11
  • 3
    never, ever, put non-escaped non-ASCII characters in a *.java* source file. The Java specs do not specify an encoding an hence experience has proved that you **SHALL** run into issue when mixing OSes, IDEs, batch/shell scripts, etc. In addition to that, on my system (Chrome on an otherwise stock Debian Linux) the chinese characters in your answer appear all as "empty rectangles" because my system doesn't have any chinese font installed. – SyntaxT3rr0r Dec 24 '10 at 14:30
0

A Java String contains unicode characters. The decoding has taken place when the string was constructed.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347