Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

Question

I am getting ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.

String text = Encoding.UTF8
                      .GetString(Encoding.GetEncoding("iso-8859-1")
                      .GetBytes("ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº"));

But isn't this hardcoding "iso-8859-1", as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.

Thanks in advance.

Related: [How can I detect the encoding codepage of a text-file](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file) — sshow, Apr 18 '13 at 12:38
how do you know what encoding that string would be representing? — Daniel A. White, Apr 18 '13 at 12:38
I did a reverse engineering to find out how it was encoded and reached to above. — user2295072, Apr 18 '13 at 12:40
Doesn't that C++ component give you a byte array/char*? How did you end of with that `ÐÐ¸ÑÐ¸Ð...` string? Show your interop code. — CodesInChaos, Apr 18 '13 at 12:52
pretty close relative. If you don't *know* the exact encoding of a couple of bytes, you need to *guess*. That's how other applications like browsers or notepad++ do it. — Corak, Apr 18 '13 at 12:58
@Corak I think the OP knows it's UTF-8, but for some reason the interop code attempted to decode it with iso-8859-1 or ANSI(the system legacy encoding). So it looks like a completely different question to me. — CodesInChaos, Apr 18 '13 at 13:16
@CodesInChaos what's the exact string class the C++ component gives, and what does your interop code look like? Just because it gives you a "string" doesn't mean the bad conversion isn't happening on your end. — Random832, Apr 19 '13 at 12:25

Daniel A.A. Pelsmaeker · Accepted Answer · 2014-03-17T17:15:04.047

3

When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.

string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);

Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.

string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!

Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.

byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!

Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:

string typedByUser = "こんにちは、世界！";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界！

The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.

However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.

By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.

edited Mar 17 '14 at 17:15

answered Apr 18 '13 at 13:17

Daniel A.A. Pelsmaeker

47,471
20
111
157

How exactly does this work with iso-8859-1 instead of cp1252? The Japanese text contains dozens of bytes that are in the 128-159 range, and non-unicode-aware programs tend to use cp1252 as a default encoding. – Random832 Apr 18 '13 at 13:39
@Random832 The input in this case was UTF-8, _not_ ISO-8859-1. In fact, ISO-8859-1 has nothing to do with it. In UTF-8 you can write both Cyrillic and Japanese characters. If the input wasn't UTF-8, but KOI8-R or CP1252, then the OP's first step of decoding ISO-8859-1 is still correct, but the second step of encoding as UTF-8 would be wrong. – Daniel A.A. Pelsmaeker Apr 18 '13 at 13:41
> In fact, ISO-8859-1 has nothing to do with it. -- except for being the encoding that the byte string was supposedly misinterpreted as. The reason I am confused is that it seems far more likely to have been misinterpreted as CP1252, so using ISO-8859-1 to convert back to a byte array should cause errors. I am suggesting that the step of decoding ISO-8859-1 is _incorrect_ (and I can't understand how it works) and ought to be decoding CP1252 instead. – Random832 Apr 18 '13 at 13:47
@Random832 I give you some text as bytes (computers work with binary data), but I don't tell you how I encoded it (encoding X). You (the C++ program) _always assumes_ it is ISO-8859-1 encoded, whether it actually is or not. Then the OP comes along, takes the ISO-8859-1 encoded text, decodes it as ISO-8859-1 (to get the bytes back) and re-encodes it as whatever encoding it originally was (X). – Daniel A.A. Pelsmaeker Apr 18 '13 at 13:55
@Random832 Programs generally don't engage in a text encoding guessing game. They just use some default (constant, or system default) encoding. And since the Cyrillic text also uses characters above 127 and this didn't change how the C++ program interpreted the text, it is safe to assume it _always_ thinks it is ISO-8859-1 encoded text. – Daniel A.A. Pelsmaeker Apr 18 '13 at 13:56
@Random832: You and I have almost the same understanding. I shouldn't be using ISO-8859-1, but through reverse engg. I figured out that's the encoding used. Anyways, I would still require a generic way of decoding it. – user2295072 Apr 19 '13 at 05:22
@user2295072 What don't you understand? The C++ program you're using will always think it is ISO-8859-1 (or CP-1252) encoded, no matter if characters other than Cyrillic come up. So you will always have to decode it like that. – Daniel A.A. Pelsmaeker Apr 19 '13 at 08:39
The programmer at the C++ component has asked me not to use language specific decoding ISO-8859-1 in this case. I need to have some generic way to decode it. – user2295072 Apr 19 '13 at 08:54
@user2295072 Well apparently the C++ programmer doesn't know what he's talking about (which is also apparent from the hack you have to do to get the character encoding straight). ISO-8859-1 is _not_ language specific in this case. It is just an error on the C++ programmer's part. – Daniel A.A. Pelsmaeker Apr 19 '13 at 09:30
@user2295072 You should use Encoding.Default - since it's likely that the codepage the C++ interprets the data as is dependent on the language of Windows that's installed (which has little to do with what the characters are). However, if you're on an East Asian language version of windows it will probably break anyway since these are stateful multibyte encodings rather than a simple mapping of bytes to characters. Really, the C++ developer should fix the app to convert from UTF-8 to UTF-16 properly and not do all this "ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº" stuff, or else just send it as a byte[]. – Random832 Apr 19 '13 at 12:22

score 0 · Answer 2 · answered Apr 18 '13 at 12:56

0

A source of characters should only be transfered in one encoding, that means it's either iso-8859-1 or something else, but not both at the same time (that means you might be wrong about the reverse engineered cyrillic in the first place)

Could you post the expected UTF-8 output of your input?

answered Apr 18 '13 at 12:56

Michiel Cornille

2,067
1
19
42

1

Input is "ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº" and Output is "Кирилл Баранник" – user2295072 Apr 19 '13 at 05:23

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

2 Answers2