0

Possible Duplicate:
How can I detect the encoding/codepage of a text file

I have a ASP.NET MVC application. In my view I upload a text file and process it with a controller method with this signature

[HttpPost]
public ActionResult FromCSV(HttpPostedFileBase file, string platform)

I get a stream from the uploaded file as file.InputStream and read it using a standard StreamReader

using (var sr = new StreamReader(file.InputStream))
{
    ...
}

The problem is, that this only works for UTF text files. When I have a text file in Windows-1250, the characters get messed up. I can work with Windows-1250 encoded text files when I explicitly specify the encoding

using (var sr = new StreamReader(file.InputStream, Encoding.GetEncoding(1250)))
{
    ...
}

My problem is, that I need to support both UTF and Windows-1250 encoded files so I need a way to detect the encoding of the submitted file.

AGuyCalledGerald
  • 7,882
  • 17
  • 73
  • 120
Igor Kulman
  • 16,211
  • 10
  • 57
  • 118
  • Is there any to know any part of the content of this file? I.e. if you knew that a particular string was likely to be there you could read it and see if it can be found, if not try it with a different encoding. – Andras Zoltan Jan 09 '13 at 12:47
  • @AndrasZoltan I only know that the files are CSV files, either created in Excel (Windows-1250) or exported from Google Docs (UTF). I do not known the content of those files. – Igor Kulman Jan 09 '13 at 12:48
  • @mathieu in this specific case (UTF-8 or 1250) that answer doesn't apply – Esailija Jan 09 '13 at 13:01
  • If you can use a BOM use it else see http://stackoverflow.com/q/90838/266919 – AxelEckenberger Jan 09 '13 at 13:19

1 Answers1

0

Trying to decode a file encoded in Windows-1250 as UTF-8 is extremely likely to cause an exception (or if not, the file is only using ASCII subset so it doesn't matter what encoding is used to decode) with exception fallback, so you could do something like this:

Encoding[] encodings = new Encoding[]{
    Encoding.GetEncoding("UTF-8", new EncoderExceptionFallback(), new DecoderExceptionFallback()),
    Encoding.GetEncoding(1250, new EncoderExceptionFallback(), new DecoderExceptionFallback())
};


String result = null;

foreach( Encoding enc in encodings ) {

    try {
        result = enc.GetString( fileAsByteArray );
        break;
    }

    catch( DecoderFallbackException e ) {

    }

}
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • If I try to read an win1250 file as UTF using your code, it throws an exception, but the next iteration that tries to read the file as win1250 gets an stream with `sr.EndOfStream==true` so there is nothing to read. I tried putting `file.InputStream.Seek(0, SeekOrigin.Begin)` after `try` but it did not help – Igor Kulman Jan 09 '13 at 13:00
  • @IgorKulman yeah I am quite shady on the details but the principle is working as you can see. Maybe you can read the file to a byte array first and use the byte array instead of stream if that's feasible. – Esailija Jan 09 '13 at 13:00
  • @IgorKulman I guess it's the `using` statement, after the first iteration the stream will be closed – Esailija Jan 09 '13 at 13:08