What's the difference between a string in the source code and a string read from a file?

Question

there is a file named "dd.txt" in my disk, it's content is \u5730\u7406

now ,when i run this program

public static void main(String[] args) throws IOException {
    FileInputStream fis=new FileInputStream("d:\\dd.txt");
    ByteArrayOutputStream baos=new ByteArrayOutputStream();
    byte[] buffer=new byte[fis.available()];
    while ((fis.read(buffer))!=-1) {
        baos.write(buffer);
    }
    String s1="\u5730\u7406";
    String s2=baos.toString("utf-8");
    System.out.println("s1:"+s1+"\n"+"s2:"+s2);
}

and i got different result

s1:地理
s2:\u5730\u7406

can you tell me why? and how i can read that file and get the same result like s1 in chinese?

Because the _compiler_ does the replacement before compiling anything. — Boris the Spider, Jul 14 '15 at 07:40
Side note: fis.available() tells you how many bytes can be read without blocking. It does not tell you the length of the input (file). — Harald K, Jul 14 '15 at 08:34

Kayaman · Answer 1 · 2015-07-14T09:49:27.430

30

When you write \u5730 in Java code, it's interpreted as a single unicode character (a unicode literal) by the compiler. When you write the same to a file, it's just 6 regular characters (because there's nothing interpreting it). Is there a reason why you're not writing 地理 directly to the file?

If you wish to read the file containing the unicode literals, you'll need to parse the values yourself, throwing away the \u and parsing the unicode codepoint yourself. It's a lot easier to just write proper unicode with a suitable encoding (e.g. UTF-8) in the file in the first place if you control the creation of the file, and under normal circumstances you should never come across files containing these escaped unicode literals.

edited Jul 14 '15 at 09:49

answered Jul 14 '15 at 07:32

Kayaman

72,141
5
83
121

i am just curious about it and want to know why. thank you ! – Paul Wang Jul 14 '15 at 09:30
10

@PaulWang if this answered your question, consider [accepting it](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). This not only gives you a little bit of rep, but also does two other things - 1) indicates to the community that this question has been answered and 2) indicates to future readers which answer solved your problem. – Boris the Spider Jul 14 '15 at 09:32

Glorfindel · Answer 2 · 2015-07-14T09:43:46.730

6

In your Java code, the \uxxxx are interpreted as Unicode literals, so they are shown as Chinese characters. This is only done so because the compiler is instructed to do so.

To obtain the same result, you have to do some parsing yourself:

String[] hexCodes = s2.split("\\\\u");
for (String hexCode : hexCodes) {
    if (hexCode.length() == 0)
        continue;
    int intValue = Integer.parseInt(hexCode, 16);
    System.out.print((char)intValue);
}

(note that this only works if every character is in Unicode literal form, e.g. \uxxxx)

edited Jul 14 '15 at 09:43

answered Jul 14 '15 at 07:37

Glorfindel

21,988
13
81
109

1,the string used to split should be "\\\\u" 2.the first element of that string arrays is empty after the string object is splited. – Paul Wang Jul 14 '15 at 09:41

score 2 · Answer 3 · 2017-12-31T21:34:07.473

2

Try this:

static final Pattern UNICODE_ESCAPE = Pattern.compile("\\\\u([0-9a-fA-F]{4})");

static String decodeUnicodeEscape(String s) {
    StringBuilder sb = new StringBuilder();
    int start = 0;
    Matcher m = UNICODE_ESCAPE.matcher(s);
    while (m.find()) {
        sb.append(s.substring(start, m.start()));
        sb.append((char)Integer.parseInt(m.group(1), 16));
        start = m.end();
    }
    sb.append(s.substring(start));
    return sb.toString();
}

public static void main(String[] args) throws IOException {
    // your code ....
    String s1="\u5730\u7406";
    String s2= decodeUnicodeEscape(baos.toString("utf-8"));
    System.out.println("s1:"+s1+"\n"+"s2:"+s2);
}

edited Dec 31 '17 at 21:34

answered Jul 14 '15 at 08:11

note that this will only support unicode characters that fit into single char. For the rest of them, try this: `sb.append(new String(Character.toChars(Integer.parseInt(m.group(1), 16))))`. [More details](https://stackoverflow.com/questions/5585919/creating-unicode-character-from-its-number/16034658#16034658) – eis Dec 30 '17 at 18:21
@eis yesterday, My code also works for surrogate pairs. – Dec 31 '17 at 21:38
@saka1029 with your code, anything above Character.MAX_VALUE (0xFFFF = 65535) would fail. Unicode code points go up to U+10FFFF. – eis Jan 01 '18 at 15:11
@eis yesterday, I assumed that "\uxxxx" represents unicode escape sequence. So "" is encoded to "\uD867\uDE3D" not "\u29E3D". – Jan 03 '18 at 09:31
ah, now I understood. I think you are correct in that. – eis Jan 03 '18 at 10:51

What's the difference between a string in the source code and a string read from a file?

3 Answers3