2

I have a string that I receive from email via C# and I want to display it in a correct format. I know the encoding in coming in as Encoding.Default, According to this answer I have to convert it to utf8, So I tried this code:

byte[] bytes = Encoding.Default.GetBytes(input);
string strResult = Encoding.UTF8.GetString(bytes);

It works, but it can't convert some characters:
Actually in web mail interface Original string is:

باسلام همکار گرامی شماره 53018 مربوط به دبیرخانه ستاد می باشد لطفا اصلاح فرمائید 

When I convert the string with the code I give this result:

باس �?ا�? �?�?�?ار گرا�?�? �?ا�?�? ش�?ار�? 53018  �?رب�?ط ب�? د ب�?رخا�?�? ستاد �?�? باشد �?طفا اص�?اح فر�?ائ�?د�? 

Any idea?
Update: PS: The content of the input variable:

Ø§ÙØ²Ø§ÙØ´ تسÙÙÙØ§Øª \r\n \r\n\r\n باس ÙØ§Ù ÙÙÙØ§Ø± گراÙÙ ÙØ§ÙÙ Ø´ÙØ§Ø±Ù
Community
  • 1
  • 1
Sirwan Afifi
  • 10,654
  • 14
  • 63
  • 110
  • please try UTF8Encoding.UTF8.GetStrings(bytes) – Issac Johnson Aug 12 '15 at 05:43
  • 1
    So, you already have a string that looks valid? Is it presented badly in the generated email? Is this an issue with your email not using utf8 encoding? – sisve Aug 12 '15 at 05:43
  • 2
    Your 1st line generates a byte array encoding input in default scheme (probably not UTF8). Your 2nd line tries to decode that byte array with another scheme (UTF8), so it becomes meaningless characters. If what you want is the byte array encoding input in UTF8, then you should use `Encoding.UTF8.getBytes` – cshu Aug 12 '15 at 06:16
  • 1
    The "default encoding" is an ansi encoding which can't handle the characters provided. `Encoding.Default` is _always_ a bad choice. – sisve Aug 12 '15 at 06:24
  • Can you tell your sender to change the encoding on his side? – Matthias Aug 12 '15 at 07:25
  • @Matthias No I can't – Sirwan Afifi Aug 12 '15 at 07:26
  • So it is possible to display the e-mail correctly on your side (computer) or not? – Matthias Aug 12 '15 at 07:27
  • @Matthias As I mentioned the e-mail correctly displays on the web mail interface. – Sirwan Afifi Aug 12 '15 at 07:29
  • Because if it is possible in the web interface, the problem is only that you need to know how to interpret the source format (probably it is then not default). – Matthias Aug 12 '15 at 07:29
  • 1
    I recommend to view the full e-mail headers inside the webmail interface to figure that out. – Matthias Aug 12 '15 at 07:30
  • .NET strings are *always* Unicode - that's just how the type is defined. You have to convert the input data *from* the original encoding *to* Unicode. Where is the code that actually reads `input`? The code you posted will *always* mangle non-ASCII (essentially non-English) strings – Panagiotis Kanavos Aug 12 '15 at 07:31
  • 1
    Please post the contents of the `input` variable. If it appears properly, you don't have to convert anything. If it contains weird characters, it was read as ASCII with the wrong codepage. Question marks or boxes mean characters were lost during that conversion. – Panagiotis Kanavos Aug 12 '15 at 07:37
  • @PanagiotisKanavos I've updated my question – Sirwan Afifi Aug 12 '15 at 07:43
  • So it's an ASCII string with a certain encoding. That encoding will be shown in the `Content-Type` header of the page and possibly as an HTML meta tag. You should use *that* encoding to read the original data and convert it to bytes, eg `var encoding=Encoding.GetEncoding(1256);var bytes=encoding....;` – Panagiotis Kanavos Aug 12 '15 at 07:46
  • Also, how *are* you reading the page? .NET's classes should make this conversion automatically if the HTTP Content-Type header is correct. Perhaps the page has the wrong header but correct meta tags? – Panagiotis Kanavos Aug 12 '15 at 07:47

1 Answers1

1

Finally solved the problem (+), As you know UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string, So we should verify that each code unit is within the range of a byte, First we should copy those values into bytes and then convert the new UTF-8 byte sequence into UTF-16:

byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
      utf8Bytes[i] = (byte)utf8String[i];
}
var result  = Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);

So for this input:

Ø§ÙØ²Ø§ÙØ´ تسÙÙÙØ§Øª \r\n\r\n\r\n<p>Ø¨Ø§Ø³ÙØ§Ù ÙÙÙØ§Ø± گراÙÙ ÙØ§ÙÙ Ø´ÙØ§Ø±Ù&nbsp;53018 &nbsp;ÙØ±Ø¨ÙØ· ب٠د Ø¨ÙØ±Ø®Ø§Ù٠ستاد Ù٠باشد ÙØ·Ùا Ø§ØµÙØ§Ø­ ÙØ±ÙØ§Ø¦ÙØ¯\r\n\r\n

I get the correct result:

افزايش تسهيلات \r\n\r\n\r\n<p>باسلام همكار گرامي نامه شماره&nbsp;53018 &nbsp;مربوط به د بيرخانه ستاد مي باشد لطفا اصلاح فرمائيد\r\n\r\n \r\n\r\n

PS: for removing extra characters I use this code:

result = result.Replace('\r', ' ').Replace('\n', ' ').ToString();
Community
  • 1
  • 1
Sirwan Afifi
  • 10,654
  • 14
  • 63
  • 110