3

I have a textfile with the content:

A B C D Ä 1 4 0 $ % & € / [ ) = ß ² µ §

If you ask me about the encoding - I have no idea. If I open it with Notepad++ I see in the encoding menu Encoding in ANSI

I would like to read this file, and recognize every character correctly. As code I have this:

//open and locking the file
using (FileStream fs = File.Open(@"C:\testfile.txt", FileMode.Open, FileAccess.Read, FileShare.None))
{
    using (TextReader reader = new StreamReader(fs))
    {
        string line;
        //reading and printing each line
        while ((line = reader.ReadLine()) != null)
        {
            System.Console.WriteLine(line);
        }
    }
}

As output I get: enter image description here

So for Ä € ß ² µ § I get a ?. That why I thought It's because of the console, so changed it to UTF8, so I'm maybe able to get a better output. But its not really helping.

System.Console.OutputEncoding = System.Text.Encoding.UTF8;

enter image description here

Thats why I think there is something wrong while reading the file. I should probably change the encoding of the StreamReader. But there are not that many options. I was trying UTF8, ASCII, but it's not helping. Any ideas?

Edit: Thanks Matthew, adding System.Text.Encoding.Default to the StreamReader is helping. Now only the char is not recognizable. Don't get it, are some chars "special"?

Edit2: alright, the was only a problem because the console is buggy(?). If I look at the string in the debug mode, the is also fine.

So the working solution for me is now:

1.) Using the reader with default encoding:

using (TextReader reader = new StreamReader(fs, System.Text.Encoding.Default))

AND

2.) Not using the console for output, just reading the string in debug mode

Marc
  • 3,905
  • 4
  • 21
  • 37
sabisabi
  • 1,501
  • 5
  • 22
  • 40
  • 1
    You should use the debugger and see if the characters are read correctly. Read them to a string first. If they are read correctly and this is only a console problem, maybe this cal help: [c# unicode string output](http://stackoverflow.com/q/5055659/7586). – Kobi Feb 05 '13 at 13:37
  • @Kobi: sry for the stupid question, but what do you mean with "read them to a string"? – sabisabi Feb 05 '13 at 13:44
  • 3
    Well, actually, you already do that: `line` is a string. So just place a breakpoint on the `System.Console.WriteLine(line);` line, and check if it was read correctly. – Kobi Feb 05 '13 at 13:53
  • 1
    @Kobi : ah alright got it :D and as I see, there are some problems with the console output, because in the debug mode I see more correct characters. If I add System.Text.Encoding.Default to the StreamReader, the € char is still a problem BUT only in console, in debug mode it's fine. So yes, one part of them problem was the console. – sabisabi Feb 05 '13 at 14:42
  • No character is special... if you have an encoding screwup then many characters (notably ASCII) will still keep working simply because the encodings involved share the same scheme for encoding those characters. The correct action is then not to think of those characters as somehow special but take it as an indication that there is an encoding mismatch somewhere happening. Note that a console has its own encoding as well, typically by default set to some MS-DOS code page in Windows. – Esailija Feb 05 '13 at 15:15
  • Also, the boxes in console are usually an indication that the console understood the characters but just cannot render them because the font used in the console doesn't have the glyphs (graphics) for those characters. – Esailija Feb 05 '13 at 15:22

2 Answers2

2

If you are using ANSI, you can do it like this:

using (TextReader reader = new StreamReader(fs, System.Text.Encoding.Default))

However, that will only work if your current code page is correct for the file that you are reading. It probably will be, but for full portability you should determine the actual code page that you're using and use:

using (TextReader reader = new StreamReader(fs, new System.Text.Encoding(codePageNumber)))

where codePageNumber is the code page of the text file.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • hej "System.Text.Encoding.Default" did help! I thought if I dont select the encoding, it's always default and I dont have to write like that. Somehow § is still not recognizable. The others are - now I'm confused. – sabisabi Feb 05 '13 at 13:52
  • If § isn't recognised, it can't be supported in your local codepage... What is your locale? (The default encoding is UTF8, by the way. Which makes it very confusing that they made Encoding.Default be ANSI....) – Matthew Watson Feb 05 '13 at 13:57
1

You can use the Mozilla Universal Charset Detector, a .NET port of which is available here to determine the encoding for a file pretty reliably. That will let you then open most files with the correct encoding with very little effort on your part.

Matt Whitfield
  • 6,436
  • 3
  • 29
  • 44