1

The intention of the code is printing unicode as japanese characters to a file

   String s = "\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093";
   var Bytes = Encoding.Unicode.GetBytes(s);      
   string  key = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Bytes));

Key is I want to print to file but has the value \u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093 Any ideas whats wrong?

remo
  • 3,326
  • 6
  • 32
  • 50
  • 5
    Your question and your example code is totally inexplicable. The original string is a C# string (which is stored internally as UTF-16.) Then you change it to a sequence of bytes representing a UTF-16 string. Then you `Convert` it to a sequence of bytes representing a UTF-8 string. Then you read those bytes back into a C# string which is identical to the string you had in the first place. What, exactly, are you trying to accomplish here? – mqp Aug 30 '11 at 18:01
  • A string in .NET is always UTF-16. So the notion of UTF-8 string doesn't make any sense. You can convert a string into a UTF-8 encoded sequence of bytes : `var bytes = Encoding.UTF8.GetBytes(s);`. Is this what you need? – Darin Dimitrov Aug 30 '11 at 18:01
  • 2
    UTF-8 is a character encoding for Unicode. – Bala R Aug 30 '11 at 18:02
  • Perhaps he's trying to get the actual characters out of it? – BalusC Aug 30 '11 at 18:03
  • He already had the actual characters in the first place; he typed them into the string. – mqp Aug 30 '11 at 18:04
  • what .NET version are you using ? – Yahia Aug 30 '11 at 18:04
  • @BalusC, what *actual characters*? He already has them in the original string. – Darin Dimitrov Aug 30 '11 at 18:04
  • @Darin: I don't do C#, but in Java, if you write this to stdout or any kind of outputstream using UTF-8 or any other Unicode encoding, it'll just show/contain the actual characters instead of unicode escape sequences. Perhaps he's trying to get it inside the String like `String s = "アップロードするファイルが指定されていません";` for some unobvious reason? His problem is likely a presentational matter. – BalusC Aug 30 '11 at 18:07
  • @BalusC: He could write that in the source file, if he liked; C# source can be Unicode. But whether he uses the escape sequences or not in his source file, it should still show the characters if he writes it to stdout or looks at it in the debugger, just like in Java. – mqp Aug 30 '11 at 18:13
  • @remo: You really need to be more clear about what you're trying to accomplish *after all*. Are you trying to show them up in some console or UI and you got `????????????` or like (thus, charset of console/UI has to be configured)? Or did you get literally the same string (thus, \ has been escaped)? You should elaborate *that* problem in more detail. – BalusC Aug 30 '11 at 18:13
  • The original string that i have there is actually read from a file, they are unicode representation of japanese characters. I wanted to see or convert them to equivalent japanese characters. That's what I need to accompolish here and the I thout utf-8 representation of japanese characters need to be seen. Let me know if something I am looking for is wrong – remo Aug 30 '11 at 18:14
  • You want to see them *where*? You should elaborate that in detail. After all, this is definitely just a presentational matter. Your question is just badly asked. "Convert Unicode to UTF-8" makes no utter sense. – BalusC Aug 30 '11 at 18:15
  • I need to print them to a file – remo Aug 30 '11 at 18:17
  • One might find this older post useful. http://stackoverflow.com/questions/1615559/converting-unicode-strings-to-escaped-ascii-string/1615860#1615860 – Dustin Kingen Aug 30 '11 at 18:20
  • 1
    One might find this older post helpful. http://stackoverflow.com/questions/1615559/converting-unicode-strings-to-escaped-ascii-string/1615860#1615860 – Dustin Kingen Aug 30 '11 at 18:22
  • @all i changed the question to make better sense i believe – remo Aug 30 '11 at 18:26

2 Answers2

3

What's wrong is that a string (key) has no notion of the bytes used to store it. In this case, your string is:

String:

アップロードするファイルが指定されていません

this is exactly what

"\u30a2\u30c3\u30d7\u30ed\u30fc\u30c9\u3059\u308b\u30d5\u30a1\u30a4\u30eb\u304c\u6307\u5b9a\u3055\u308c\u3066\u3044\u307e\u305b\u3093" 

means. The expression '\u30a3' looks like 2 Unicode bytes, but it actually just means the character 'ア'.

if you save to a UTF-8 file, the bytes written will be:

UTF-8 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.UTF8);

The contents will be (in bytes)

 E3 82 A2 E3 83 83 E3 83 97 E3 83 AD E3 83 BC E3 83 89 E3 81 99 E3 82 8B E3 83 
 95 E3 82 A1 E3 82 A4 E3 83 AB E3 81 8C E6 8C 87 E5 AE 9A E3 81 95 E3 82 8C E3 
 81 A6 E3 81 84 E3 81 BE E3 81 9B E3 82 93

UTF-16 bytes

File.WriteAllText("temp.txt", "アップロードするファイルが指定されていません", Encoding.Unicode);

The contents will be (in bytes)

 A2 30 C3 30 D7 30 ED 30 FC 30 C9 30 59 30 8B 30 D5 30 A1 30 A4 30 EB 30 4C 30 
 07 63 9A 5B 55 30 8C 30 66 30 44 30 7E 30 5B 30 93 30
Jimmy
  • 89,068
  • 17
  • 119
  • 137
  • This is what I was looking for, I wanted to know how you could decode those japanese characters here.. – remo Aug 30 '11 at 18:18
  • Thanks very much, How could I open a UTF_8 txt file as you show in `The contents will be (in bytes)` part? – Luke Jun 21 '19 at 01:21
  • 1
    @Luke: System.IO.StreamReader's constructor takes an Encoding parameter, and it defaults to UTF-8, so if you read a UTF-8 file with StreamReader it should work as expected. Otherwise, `File.ReadAllText` also can take an encoding parameter, so `File.ReadAllText("my_utf8_file.txt", Encoding.UTF8)` or `File.ReadAllText("my_utf16_file.txt", Encoding.Unicode)` should work – Jimmy Jun 23 '19 at 15:37
0

One doesn't "convert" Unicode to UTF-8 :-/

Unicode, besides being the parent for the entire set of specifications, can be thought of as "simply" defining code-points/characters and the rules of interaction. The UTF-8 encoding is the specific set of rules to map a sequence of Unicode code-points into a sequence of octets (8-bit bytes).

Try this in LINQPad:

String s = "\u30a2\u30c3\u30d7\u30ed";
s.Dump();     // original string
var bytes = Encoding.UTF8.GetBytes(s);      
bytes.Dump(); // see UTF-8 encoded byte sequence
string key = Encoding.UTF8.GetString(bytes);
key.Dump();   // contents restored

The UTF-8 exists only in bytes.

Happy coding.

  • In C#, `Encoding.Unicode` means UTF-16, so it's possible that when the OP says "Unicode" he means UTF-16 in particular. – mqp Aug 30 '11 at 18:07
  • @mquander Very true. I agree with your post comment ;-) –  Aug 30 '11 at 18:08