convert string from Windows 1256 to UTF-8

Question

I want to convert strings from a database that has a Windows-1256 to UTF-8. Database is in Persian.

I used code below, but I receive question mark: ????.

string text= "راوي"; // should be "راوی"
byte[] encoded = Encoding.GetEncoding(1256).GetBytes(text);
string result= Encoding.UTF8.GetString(encoded);

How can I do this conversion?

show the code you use to retrieve a 1256-encoded value from the database. your code sample won't work as intended because c# string variables are utf8 and forcing them into 1256 garbles it. — Cee McSharpface, Apr 19 '18 at 19:18
@dlatikay `.NET uses the UTF-16 encoding` => https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding — Eser, Apr 19 '18 at 19:27
right. I still think we need to see the database code to help here. some database providers have character encoding options in the connection properties, would be good to know which RDBMS and driver we're talking about too. — Cee McSharpface, Apr 19 '18 at 19:32
You're decoding Win-1252 bytes as if they were UTF-8. That's plain wrong, and probably the opposite of what you want to do. Strings in .Net are string objects; their internal encoding is irrelevant. Encoding only matters if want to convert them to bytes for some reason. — Nyerguds, Apr 19 '18 at 19:42
I wrote a C program that converts a file from windows-1256 to UTF-8, byte by byte. You may be able to write the same program with any language. check it here: https://github.com/mutawa/win2utf — Ahmad, Mar 17 '20 at 11:12

Remy Lebeau · Accepted Answer · 2018-04-19T19:55:36.900

The code that is presented takes a native .NET string (which uses UTF-16 encoding), encodes it to Windows-1256, then mis-interprets that result as UTF-8 when it really isn't. So, of course the decoding of UTF-8 will produce ? for non-ASCII characters, as they will not have been encoded as UTF-8 to begin with.

The code is not doing what the question is asking for.

The correct way to convert Windows-1256 (or any other encoding) to UTF-8 is to first take the source data as-is and decode it to UTF-16 using the original encoding, and then encode that result to UTF-8, eg:

byte[] Win1256Data = ...;
string s = Encoding.GetEncoding(1256).GetString(Win1256Data);
byte[] Utf8Data = Encoding.UTF8.GetBytes(s);

Alternatively, the Encoding class has a Convert() method to handle the intermediate conversion for you:

byte[] Win1256Data = ...;
byte[] Utf8Data = Encoding.Convert(Encoding.GetEncoding(1256), Encoding.UTF8, Win1256Data);

convert string from Windows 1256 to UTF-8

1 Answers1

Linked