C# : Japanese characters with unicode encoding

Question

The intention of the code is printing unicode as japanese characters to a file

   String s = "\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093";
   var Bytes = Encoding.Unicode.GetBytes(s);      
   string  key = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Bytes));

Key is I want to print to file but has the value \u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093 Any ideas whats wrong?

Your question and your example code is totally inexplicable. The original string is a C# string (which is stored internally as UTF-16.) Then you change it to a sequence of bytes representing a UTF-16 string. Then you `Convert` it to a sequence of bytes representing a UTF-8 string. Then you read those bytes back into a C# string which is identical to the string you had in the first place. What, exactly, are you trying to accomplish here? — mqp, Aug 30 '11 at 18:01
A string in .NET is always UTF-16. So the notion of UTF-8 string doesn't make any sense. You can convert a string into a UTF-8 encoded sequence of bytes : `var bytes = Encoding.UTF8.GetBytes(s);`. Is this what you need? — Darin Dimitrov, Aug 30 '11 at 18:01
He already had the actual characters in the first place; he typed them into the string. — mqp, Aug 30 '11 at 18:04
@BalusC, what *actual characters*? He already has them in the original string. — Darin Dimitrov, Aug 30 '11 at 18:04
@Darin: I don't do C#, but in Java, if you write this to stdout or any kind of outputstream using UTF-8 or any other Unicode encoding, it'll just show/contain the actual characters instead of unicode escape sequences. Perhaps he's trying to get it inside the String like `String s = "アップロードするファイルが指定されていません";` for some unobvious reason? His problem is likely a presentational matter. — BalusC, Aug 30 '11 at 18:07
@BalusC: He could write that in the source file, if he liked; C# source can be Unicode. But whether he uses the escape sequences or not in his source file, it should still show the characters if he writes it to stdout or looks at it in the debugger, just like in Java. — mqp, Aug 30 '11 at 18:13
@remo: You really need to be more clear about what you're trying to accomplish *after all*. Are you trying to show them up in some console or UI and you got `????????????` or like (thus, charset of console/UI has to be configured)? Or did you get literally the same string (thus, \ has been escaped)? You should elaborate *that* problem in more detail. — BalusC, Aug 30 '11 at 18:13
The original string that i have there is actually read from a file, they are unicode representation of japanese characters. I wanted to see or convert them to equivalent japanese characters. That's what I need to accompolish here and the I thout utf-8 representation of japanese characters need to be seen. Let me know if something I am looking for is wrong — remo, Aug 30 '11 at 18:14
You want to see them *where*? You should elaborate that in detail. After all, this is definitely just a presentational matter. Your question is just badly asked. "Convert Unicode to UTF-8" makes no utter sense. — BalusC, Aug 30 '11 at 18:15
One might find this older post useful. http://stackoverflow.com/questions/1615559/converting-unicode-strings-to-escaped-ascii-string/1615860#1615860 — Dustin Kingen, Aug 30 '11 at 18:20
One might find this older post helpful. http://stackoverflow.com/questions/1615559/converting-unicode-strings-to-escaped-ascii-string/1615860#1615860 — Dustin Kingen, Aug 30 '11 at 18:22

score 3 · Accepted Answer · answered Aug 30 '11 at 18:13

What's wrong is that a string (key) has no notion of the bytes used to store it. In this case, your string is:

String:

アップロードするファイルが指定されていません

this is exactly what

"\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093"

means. The expression '\u30a3' looks like 2 Unicode bytes, but it actually just means the character 'ア'.

if you save to a UTF-8 file, the bytes written will be:

UTF-8 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.UTF8);

The contents will be (in bytes)

 E3 82 A2 E3 83 83 E3 83 97 E3 83 AD E3 83 BC E3 83 89 E3 81 99 E3 82 8B E3 83 
 95 E3 82 A1 E3 82 A4 E3 83 AB E3 81 8C E6 8C 87 E5 AE 9A E3 81 95 E3 82 8C E3 
 81 A6 E3 81 84 E3 81 BE E3 81 9B E3 82 93

UTF-16 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.Unicode);

The contents will be (in bytes)

 A2 30 C3 30 D7 30 ED 30 FC 30 C9 30 59 30 8B 30 D5 30 A1 30 A4 30 EB 30 4C 30 
 07 63 9A 5B 55 30 8C 30 66 30 44 30 7E 30 5B 30 93 30

This is what I was looking for, I wanted to know how you could decode those japanese characters here.. — remo, Aug 30 '11 at 18:18
Thanks very much, How could I open a UTF_8 txt file as you show in `The contents will be (in bytes)` part? — Luke, Jun 21 '19 at 01:21
@Luke: System.IO.StreamReader's constructor takes an Encoding parameter, and it defaults to UTF-8, so if you read a UTF-8 file with StreamReader it should work as expected. Otherwise, `File.ReadAllText` also can take an encoding parameter, so `File.ReadAllText("my_utf8_file.txt", Encoding.UTF8)` or `File.ReadAllText("my_utf16_file.txt", Encoding.Unicode)` should work — Jimmy, Jun 23 '19 at 15:37

score 0 · Answer 2 · 2011-08-30T18:16:10.480

0

One doesn't "convert" Unicode to UTF-8 :-/

Unicode, besides being the parent for the entire set of specifications, can be thought of as "simply" defining code-points/characters and the rules of interaction. The UTF-8 encoding is the specific set of rules to map a sequence of Unicode code-points into a sequence of octets (8-bit bytes).

Try this in LINQPad:

String s = "\u30a2\u30c3\u30d7\u30ed";
s.Dump();     // original string
var bytes = Encoding.UTF8.GetBytes(s);      
bytes.Dump(); // see UTF-8 encoded byte sequence
string key = Encoding.UTF8.GetString(bytes);
key.Dump();   // contents restored

The UTF-8 exists only in bytes.

Happy coding.

edited Aug 30 '11 at 18:16

answered Aug 30 '11 at 18:05

In C#, `Encoding.Unicode` means UTF-16, so it's possible that when the OP says "Unicode" he means UTF-16 in particular. – mqp Aug 30 '11 at 18:07
@mquander Very true. I agree with your post comment ;-) – Aug 30 '11 at 18:08

C# : Japanese characters with unicode encoding

2 Answers2