17

there is a file named "dd.txt" in my disk, it's content is \u5730\u7406

now ,when i run this program

public static void main(String[] args) throws IOException {
    FileInputStream fis=new FileInputStream("d:\\dd.txt");
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    byte[] buffer=new byte[fis.available()];
    while ((fis.read(buffer))!=-1) {
        baos.write(buffer);
    }
    String s1="\u5730\u7406";
    String s2=baos.toString("utf-8");
    System.out.println("s1:"+s1+"\n"+"s2:"+s2);
}

and i got different result

s1:地理
s2:\u5730\u7406

can you tell me why? and how i can read that file and get the same result like s1 in chinese?

user253751
  • 57,427
  • 7
  • 48
  • 90
Paul Wang
  • 173
  • 6

3 Answers3

30

When you write \u5730 in Java code, it's interpreted as a single unicode character (a unicode literal) by the compiler. When you write the same to a file, it's just 6 regular characters (because there's nothing interpreting it). Is there a reason why you're not writing 地理 directly to the file?

If you wish to read the file containing the unicode literals, you'll need to parse the values yourself, throwing away the \u and parsing the unicode codepoint yourself. It's a lot easier to just write proper unicode with a suitable encoding (e.g. UTF-8) in the file in the first place if you control the creation of the file, and under normal circumstances you should never come across files containing these escaped unicode literals.

Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • i am just curious about it and want to know why. thank you ! – Paul Wang Jul 14 '15 at 09:30
  • 10
    @PaulWang if this answered your question, consider [accepting it](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). This not only gives you a little bit of rep, but also does two other things - 1) indicates to the community that this question has been answered and 2) indicates to future readers which answer solved your problem. – Boris the Spider Jul 14 '15 at 09:32
6

In your Java code, the \uxxxx are interpreted as Unicode literals, so they are shown as Chinese characters. This is only done so because the compiler is instructed to do so.

To obtain the same result, you have to do some parsing yourself:

String[] hexCodes = s2.split("\\\\u");
for (String hexCode : hexCodes) {
    if (hexCode.length() == 0)
        continue;
    int intValue = Integer.parseInt(hexCode, 16);
    System.out.print((char)intValue);
}

(note that this only works if every character is in Unicode literal form, e.g. \uxxxx)

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
  • 1,the string used to split should be "\\\\u" 2.the first element of that string arrays is empty after the string object is splited. – Paul Wang Jul 14 '15 at 09:41
2

Try this:

static final Pattern UNICODE_ESCAPE = Pattern.compile("\\\\u([0-9a-fA-F]{4})");

static String decodeUnicodeEscape(String s) {
    StringBuilder sb = new StringBuilder();
    int start = 0;
    Matcher m = UNICODE_ESCAPE.matcher(s);
    while (m.find()) {
        sb.append(s.substring(start, m.start()));
        sb.append((char)Integer.parseInt(m.group(1), 16));
        start = m.end();
    }
    sb.append(s.substring(start));
    return sb.toString();
}

public static void main(String[] args) throws IOException {
    // your code ....
    String s1="\u5730\u7406";
    String s2= decodeUnicodeEscape(baos.toString("utf-8"));
    System.out.println("s1:"+s1+"\n"+"s2:"+s2);
}
  • note that this will only support unicode characters that fit into single char. For the rest of them, try this: `sb.append(new String(Character.toChars(Integer.parseInt(m.group(1), 16))))`. [More details](https://stackoverflow.com/questions/5585919/creating-unicode-character-from-its-number/16034658#16034658) – eis Dec 30 '17 at 18:21
  • @eis yesterday, My code also works for surrogate pairs. –  Dec 31 '17 at 21:38
  • @saka1029 with your code, anything above Character.MAX_VALUE (0xFFFF = 65535) would fail. Unicode code points go up to U+10FFFF. – eis Jan 01 '18 at 15:11
  • @eis yesterday, I assumed that "\uxxxx" represents unicode escape sequence. So "" is encoded to "\uD867\uDE3D" not "\u29E3D". –  Jan 03 '18 at 09:31
  • ah, now I understood. I think you are correct in that. – eis Jan 03 '18 at 10:51