Convert Latin 1 encoded UTF8 to Unicode

Question

I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as Ð°Ð±Ð²Ð³Ð´Ð. When I pull them out of the db into my C# app, into strings, I still see Ð°Ð±Ð²Ð³Ð´Ð. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?

Hey. Perhaps if you show us an extract of the code you're using to retrieve the strings from the database, this might help. Also what sort of database is it? MS SQL? — CraftyFella, Sep 18 '09 at 08:06
This question is incoherent. What on earth is "latin 1 encoded UTF 8"? — Mark Amery, May 22 '17 at 10:14

bobince · Answer 1 · 2013-07-03T13:24:18.787

Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(s))

Now you have a normal Unicode string containing Cyrillic.

Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252) instead.

Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.

You sir, are pure gold with that "better to start using NVARCHAR", saved me tons of time searching for how to encode/decode strings or alter database collation. Live long and prosper!!! — Zahari Kitanov, Jan 25 '20 at 23:53

score 1 · Answer 2 · 2009-09-16T14:15:57.833

1

I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.

 //re interpret latin1 in proper utf8 encoding
 str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));

 //convert from utf8 to 1251
 str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));

edited Sep 16 '09 at 14:15

answered Sep 16 '09 at 13:53

2

I'm not sure what the point of the second line is. Encode as UTF-8, transcode to cp1251 (why not just GetBytes on the 1251 Encoding in the first place?) then get a Unicode string back from those bytes? All this will do is filter out any characters not present in 1251 from your Unicode string. int version: http://msdn.microsoft.com/en-us/library/wzsz3bk3.aspx – bobince Sep 16 '09 at 23:09

Convert Latin 1 encoded UTF8 to Unicode

2 Answers2

Linked

Related