Can't reproduce ANSI Encoding to Windows-1256 by C#

Question

I have some encoded data in mdb file, like this Úæäí, and ÚáÇä; I tried with notepad++, first creating new file with ANSI Encoding, after that putting that text on to it, finally changing the encoding to Windows-1256, the result is عوني ,علان perfect, but i can't reproduce this scenario by coding(C#). here is the Code:

public string Decode(DataRow rw,string colName)
{
   Encoding srcEnc = Encoding.GetEncoding("from what ?");
   Encoding destEnc = Encoding.GetEncoding("1256");// arabic encoding
   byte[] srcVal = rscEnc.GetBytes(rw[colName].ToString());
   byte[] destVal = Encoding.Convert(srcEnc,destEnc,srcVal);
   return destEnc.GetString(destVal);
}

But what is the `rw[colName].GetType()`? – xanatos May 03 '15 at 12:18 — xanatos, May 03 '15 at 12:18

Charles Mager · Accepted Answer · 2015-05-03T13:37:43.303

4

The problem is you're converting between encodings. This isn't actually what you're trying to achieve, you just want to re-interpret the encoded text.

To do this, you need to get the bytes for your ANSI string and then decode it using the correct encoding.

So, leaving out the conversion:

var latin = Encoding.GetEncoding(1252);
var bytes = latin.GetBytes("Úæäí");

var arabic = Encoding.GetEncoding(1256);            
var result = arabic.GetString(bytes);

result is عوني

A caveat, as Hans points out in the comments: Windows-1252 has 5 byte values that are unused (0x81, 0x8D, 0x8F, 0x90, and 0x9D). If these correspond to characters in Windows-1256 used in the original text, then your source data is corrupted as these characters will have been lost on the initial decoding using 1252. Ideally, you want to start with the original encoded source.

edited May 03 '15 at 13:37

answered May 03 '15 at 12:38

Charles Mager

25,735
2
35
45

You need to explain that this is a *lossy* conversion, there are several byte values that are not valid in 1252 (0x81, 0x8d, 0x8f, etc). But *are* valid in 1256. The inevitable outcome is a corrupted string with question marks. Only real fix is to get the encoding correct up front. – Hans Passant May 03 '15 at 13:01
@HansPassant - thanks, I hadn't considered! I've edited to incorporate your comments. – Charles Mager May 03 '15 at 13:38
To solve the problem use the ISO-8859-1 instead of the 1252. The 8859-1 maps all the first 256 characters to characters with the same code. – xanatos May 03 '15 at 17:16

Can't reproduce ANSI Encoding to Windows-1256 by C#

1 Answers1

Linked