52

I have a file that contains non-English chars and was saved in ANSI encoding using a non-English codepage. How can I read this file in C# and see the file content correctly?

Not working

StreamReader sr=new StreamReader(@"C:\APPLICATIONS.xml",Encoding.ASCII);
var ags = sr.ReadToEnd();
sr=new StreamReader(@"C:\APPLICATIONS.xml",Encoding.UTF8);
ags = sr.ReadToEnd();
sr=new StreamReader(@"C:\APPLICATIONS.xml",Encoding.Unicode);
ags = sr.ReadToEnd();

Working but I need to know what is the code page in advance, which is not possible.

sr=new StreamReader(@"C:\APPLICATIONS.xml",Encoding.GetEncoding(1252));
ags = sr.ReadToEnd();
dda
  • 6,030
  • 2
  • 25
  • 34
MichaelT
  • 7,574
  • 8
  • 34
  • 47

6 Answers6

74
 var text = File.ReadAllText(file, Encoding.GetEncoding(codePage));

List of codepages : https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers?redirectedfrom=MSDN

spottedmahn
  • 14,823
  • 13
  • 108
  • 178
L.B
  • 114,136
  • 19
  • 178
  • 224
  • 2
    I will need to know the code page. I don't know it in advance. – MichaelT Aug 26 '12 at 13:07
  • @MichaelT there are some open source libraries to *guess* the encoding, but it is not an easy process. – L.B Aug 26 '12 at 13:08
  • 1
    I saw that old MS notepad is handling this file with no problems and thinking I missing something. – MichaelT Aug 26 '12 at 13:11
  • 5
    @MichaelT [How can I detect the encoding/codepage of a text file](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file) – L.B Aug 26 '12 at 13:23
  • 5
    Remember http://www.joelonsoftware.com/articles/Unicode.html - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky – gimel Aug 27 '12 at 06:42
  • Notepad guesses the current locale's code page, which you can get from `Encoding.Default`. However an XML file in the locale code page without a specific `` encoding saying so is an outright error. – bobince Aug 27 '12 at 20:35
  • Please note that .NET Core only supports ASCII, ISO-8859-1 and Unicode encodings. So you will get an error when trying to use encoding 1252 (ANSI Latin 1; Western European Windows). What works for me is encoding 65000 (utf-7 Unicode). – Martijn Sep 17 '20 at 08:45
14

You get the question-mark-diamond characters when your textfile uses high-ANSI encoding -- meaning it uses characters between 127 and 255. Those characters have the eighth (i.e. the most significant) bit set. When ASP.NET reads the textfile it assumes UTF-8 encoding, and that most significant bit has a special meaning.

You must force ASP.NET to interpret the textfile as high-ANSI encoding, by telling it the codepage is 1252:

String textFilePhysicalPath = System.Web.HttpContext.Current.Server.MapPath("~/textfiles/MyInputFile.txt");
String contents = File.ReadAllText(textFilePhysicalPath, System.Text.Encoding.GetEncoding(1252));
lblContents.Text = contents.Replace("\n", "<br />");  // change linebreaks to HTML
Snizzle
  • 141
  • 1
  • 2
  • 2
    Should be the accepted answer IMHO.. Furthermore with .NET core 2.x or .NET Standard you will get a new problem. Codepage need to be registered before .. see https://stackoverflow.com/questions/37870084/net-core-doesnt-know-about-windows-1252-how-to-fix – Philm Jul 27 '19 at 13:24
  • 1
    Please note that .NET Core only supports ASCII, ISO-8859-1 and Unicode encodings. So you will get an error when trying to use encoding 1252 (ANSI Latin 1; Western European Windows). What works for me is encoding 65000 (utf-7 Unicode). – Martijn Sep 17 '20 at 08:36
2

If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding. You would have to create a StreamReader with the correct encoding and use that as the parameter.

xmlDoc.Load(new StreamReader(
                     File.Open("file.xml"), 
                     Encoding.GetEncoding("iso-8859-15"))); 

I just stumbled across KB308061 from Microsoft. There's an interesting passage: Specify the encoding declaration in the XML declaration section of the XML document. For example, the following declaration indicates that the document is in UTF-16 Unicode encoding format:

<?xml version="1.0" encoding="UTF-16"?>

Note that this declaration only specifies the encoding format of an XML document and does not modify or control the actual encoding format of the data.

Link Source:

XmlDocument.Load() method fails to decode € (euro)

Community
  • 1
  • 1
KF2
  • 9,887
  • 8
  • 44
  • 77
  • why not [*`File.ReadAllText`*](http://msdn.microsoft.com/en-us/library/ms143369.aspx)? – Adam Aug 26 '12 at 13:04
  • -@MichaelT can u give a screen shot of your result? – KF2 Aug 26 '12 at 13:06
  • -@MichaelT :try my new answer – KF2 Aug 26 '12 at 13:29
  • If the `` prolog in your XML file says UTF-8, and it's not a proper UTF-8 stream, then what you have got is not well-formed and thereby not XML. Really you need to fix whatever is producing the bogus XML files. – bobince Aug 27 '12 at 20:32
0

In my case of c++/clr (WinForms) such approach had a success:

String^ str2 = File::ReadAllText("MyText_cyrillic.txt",System::Text::Encoding::GetEncoding(1251)); 
textBox1->Text = str2;   
0
using (StreamReader file = new StreamReader(filePath, Encoding.GetEncoding("ISO-8859-1")))
{
JsonSerializer serializer = new JsonSerializer();
IList<Type> result= (IList<Type>)serializer.Deserialize(file, typeof(IList<Type>));
      
                }
    
ANSI Code : ISO-8859-1
Tayyeb
  • 127
  • 7
-1
using (StreamWriter writer = new StreamWriter(File.Open(@"E:\Sample.txt", FileMode.Append), Encoding.GetEncoding(1250)))  ////File.Create(path)
        {
            writer.Write("Sample Text");
        }