-1

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:

tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);

No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.

I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:

tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";

Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.

enter image description here

I then presented the saved file to my code and it displayed all the content correctly.

I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):

enter image description here

The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.

What puzzles me is specifying the " Encoding.UTF8" value should take care of this.

The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.

Edit:

Still no luck despite trying the suggestion in the answer.

I cut the test data down to containing just Arabic as follows:

enter image description here

Code as follows:

FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();

Tried with both "true" and "false" on the 2nd line, but both give the same result.

If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.

Here is what is looks like in Notepad++ (and what I would liek the textbox to display):

enter image description here

Not sure if the issue is in the reading from file, or the writing to the textbox.

I will try inspecting the data post read to see. But at the moment, I'm puzzled.

TenG
  • 3,843
  • 2
  • 25
  • 42
  • http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom – Stefano D Aug 03 '16 at 16:34
  • You probably showed the file after Notepad wrote it, that doesn't help. A BOM is controversial, Unix OSes have adopted utf-8 but most utilities cannot properly handle a BOM. When you pass Encoding.UTF8 then you still leave it up to the File class to detect the BOM and override your choice if it has one. Update your hex dump with the actual file content. – Hans Passant Aug 03 '16 at 16:46
  • 1
    Are you certain the bytes that are supposed to be the Arabic characters are actually the correct UTF8 representation of said characters? I've seen very frequently characters that are passed off as UTF8 but are actually bytes from a different character set (such as ISO-8859-6 or Windows-1256). That leads to display issues such as this. – Dean Goodman Aug 03 '16 at 23:53
  • Thanks Dean. I'll take a look and see if there is anything in the file that is not UTF8. – TenG Aug 04 '16 at 09:57

1 Answers1

3

The StreamReader class has a constructor that will take care of testing for the BOM for you:

using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
    using (var sr = new StreamReader(stream, Encoding.UTF8, true))
    {
        var text = sr.ReadToEnd();
    }
}

The final true parameter is detectEncodingFromByteOrderMark:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:

  • UTF-8
  • little-endian Unicode
  • and big-endian Unicode text

if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
Dean Goodman
  • 973
  • 9
  • 22
  • Thank you Dean. Your answer makes sense, but I still am not able to get the desired result. Please see my "Edit" to the question to see the results after trying your suggestion. – TenG Aug 03 '16 at 23:22