read special characters in c# from a file

Question

I have a text file which contains one of the following strings.

 Match List – I and List – II and identify the correct code :

I have the following code in C# to read this line and when i console.write it I get

  Match List ? I and List ? II and identify the correct code :

on close scrutiny i got to understand :

 – and - are different!

now how can I read the file and get it exactly? I need the data to be stored in a database.

my code:

       string filefullnamepath = @"E:\PROJECTS\NETSLET\Console\Console\files\sample.txt";
        string filecontents = "";
        using (StreamReader sr = System.IO.File.OpenText(filefullnamepath))
        {
            filecontents = sr.ReadToEnd();
        }

Ok I added the following line:

 Console.OutputEncoding = System.Text.Encoding.UTF8;

now i get

I need to store the contents in the database. Even in the database it is stored as ? I am using ms-sql server 2018r2

You are reading it exactly, it's the console that can't display the characters. — Ron Beyer, Mar 19 '18 at 14:43
What encoding is the file using? This looks like an encoding mismatch, i.e. it's encoded in ANSI and your reading it in UTF-8 — Liam, Mar 19 '18 at 14:43
The console can't display non-ANSI characters unless you use a Unicode font. Don't use it to check text — Panagiotis Kanavos, Mar 19 '18 at 14:45
@Liam OpenText uses UTF8. The console though *can't* display Unicode text without a Unicode font *and* setting the OutputEncoding to UTF8 [as shown here](https://stackoverflow.com/questions/5750203/how-to-write-unicode-characters-to-the-console) — Panagiotis Kanavos, Mar 19 '18 at 14:48
@itsme86 there's no standard term to refer to "single byte codepage thtat's not what you thought it was". ASCII is often used to mean the 7-bit latin codepage but back when people actually used codepages, ASCII were all of them — Panagiotis Kanavos, Mar 19 '18 at 14:53
@itsme86: if you want to get technical (and why not, we're on SO) when people say "ANSI" they mean "the default code page of the system", which Windows has unfortunately taken to name "ANSI" in a distant past, even though it almost never *is* (standardized by) ANSI. There is, in fact, no Windows code page that corresponds exactly to ASCII -- there are plenty that have ASCII as a subset, though. So the console can display *some* "funny characters", but typically not all of them. — Jeroen Mostert, Mar 19 '18 at 14:53
I have *never* heard someone say ANSI when referring to the default code page of the system. And I've been working with computers a *long* time. — itsme86, Mar 19 '18 at 15:01
@itsme86: you may not have heard someone *say* it, but it happens enough there's [articles written on the confusion](https://blogs.msdn.microsoft.com/oldnewthing/20051027-37/?p=33593/). Most generally, "ANSI" can only be reliably taken to mean "not Unicode"; you have to ask the speaker what they're really going for. Sometimes the docs [don't even bother to explain what they mean by it](https://learn.microsoft.com/dotnet/framework/interop/specifying-a-character-set). In the context of console output, though, it's the default code page. This is very much a Windowsism. — Jeroen Mostert, Mar 19 '18 at 15:15

Dmitry Bychenko · Accepted Answer · 2018-03-19T15:25:31.053

First of all, inspect what have you read (do you have correct encoding?):

  string path = @"E:\PROJECTS\NETSLET\Console\Console\files\sample.txt";

  // Easier way to read than Streams
  string fileContent = File.ReadAllText(path);

  string dump = string.Concat(fileContent
    .Select(c =>  c < 32 || c > 127 
       ? $"\\u{(int)c:x4}"  // Encode command chars and unicode ones
       : c.ToString()));    // preserve ASCII intact

  Console.Write(dump);

If you get (please, notice \u2013 characters)

  Match List \u2013 I and List \u2013 II and identify the correct code :

then the reading is correct and it's output which is wrong. You should change the font you are using. If dump doesn't look like above, but as (please, notice ?):

  Match List ? I and List ? II and identify the correct code :

It means that the system can't read the characters and thus substitute it with ?; so the problem is in the reading, is in the encoding. Try putting it explicitly

  // Utf-8
  string fileContent = File.ReadAllText(path, Encoding.UTF8);
  ...
  // Win-1250 
  string fileContent = File.ReadAllText(path, Encoding.GetEncoding(1250));

Edit: In worse case, when you can't just save the file with required encoding, but you have to guess the original one you can try automating the process:

  string path = "";

  var tries = Encoding.GetEncodings()
    .Select(encoding => new {
       encoding = encoding,
       text = File.ReadAllText(path, encoding.GetEncoding()),  
     } )  
    .Select(item => $"{item.encoding.Name,-8} => {item.text} <- {(item.text.Any(c => c == 0x2013 ? "got it!" : "wrong"))}");

  Console.WriteLine(string.Join(Environment.NewLine, tries));

Possible output:

  IBM037  => Match List ? I and List ? II and identify the correct code :  <- wrong
  IBM437  => Match List ? I and List ? II and identify the correct code :  <- wrong 
  ...
  windows-1250 => Match List – I and List – II and identify the correct code :  <- got it! 
  ...
  utf-8   => Match List ? I and List ? II and identify the correct code :  <- wrong

I used the notepad and it saved it as an ASCII Format by default. I reopened the file and then saved the file in the utf-8 format. Now it worked. Thanks for your help — Venkat, Mar 19 '18 at 15:13
@Venkat: it's nice that the current problem appeared to have an easy solution; I've edited the answer, however, in case you'll face a worse problem where you'll have to *guess* the encoding used and not allowed to change the initial files — Dmitry Bychenko, Mar 19 '18 at 15:29

read special characters in c# from a file

1 Answers1