Encode and Decode multilingual string c#

Question

I want to encode and then decode a string that contains multilingual characters, in which the language, length and character positioning (like, chinese character on indexes 8-10) are unknown.

Is it even possible to have a "universal" encoder? Or some algorithm that knows how to decode this?

Searching the web came up with only solutions that involved knowing where the special characters are, and of what language, and I cant even know the language itself.

Any ideas?

EDIT: Example: a string that consists of several languages, such as:

"Hello {CHINESE} my {LATIN} is rusted"

which consists of english, chinese, and latin.

But when I do

var test = ASCIIEncoding.ASCII.GetBytes(someStr);

and then

ASCIIEncoding.ASCII.GetString(test)

the "special characters" (IE, not english characters) are converted to question marks

What do you mean by "encode"? What context makes some characters "special"? No character is any more special than any other other than in a given context (e.g. `漢` is special in URLs but not in HTML). — Jon Hanna, Mar 01 '17 at 14:54
Can you provide some examples? Right now it is unclear what is your concrete problem and what is your goal. — Andrey Korneyev, Mar 01 '17 at 14:54
UTF16 (and UTF8) are perfectly good encodings that support all the characters that you'll use :-) — xanatos, Mar 01 '17 at 14:57
Ok... So don't use `ASCIIEncoding`? It is a relic of a bygone era... Use `Encoding.UTF8.GetBytes`. and `Encoding.UTF8.GetString` — xanatos, Mar 01 '17 at 15:05

score 3 · Accepted Answer · answered Mar 01 '17 at 15:05

3

Don't use ASCII encoding since it isn't supposed to handle multiple language characters in the same string.

Use Unicode instead:

var test = UnicodeEncoding.Unicode.GetBytes(someStr);
var test1 = UnicodeEncoding.Unicode.GetString(test);

answered Mar 01 '17 at 15:05

Andrey Korneyev

26,353
15
70
71

Which one is better? UnicodeEncoding or Encoding.UTF8.GetString ? – Tomer Something Mar 01 '17 at 15:09
1

@TomerSomething if your text mostly contains latin characters - then UTF8 can be better for you. `UnicodeEncoding.Unicode` is in fact UTF16. – Andrey Korneyev Mar 01 '17 at 15:12

Encode and Decode multilingual string c#

1 Answers1