3

I want my C# application (which has a GUI) to help the user choose between "unicode (utf-8)" and "legacy (cp1252)". I would like to give the user two independent true/false readings regarding whether the file can be 'successfully' (though not necessarily correctly) read in in those two formats with no loss of detail.

When I tried the following in C#, it didn't work. That is, it seems to always return true, even if I call it on a utf-8 text file that I know contains non-Roman characters.

[EDIT: Actually, I shouldn't have thought this should fail. Could be one of those reasonable successes that happens to be incorrect, since most (all?) byte streams are also valid cp1252. Testing the other direction does find invalid utf-8 as the Python code below does.]

E.g. CanBeReadAs("nepali.txt", Encoding.GetEncoding(1252)) ought to return false, but it returns true.

public static bool CanBeReadAs(string filePath, Encoding encoding)
    {
        // make it strict:
        encoding = Encoding.GetEncoding(encoding.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
        using (var r = new StreamReader(filePath, encoding, false))
        {
            try
            {
                r.ReadToEnd();
            }
            catch (Exception e)
            {
                //swallow
                return false;
            }
        }
        return true;
    }

I've also tried it with "string s = r.ReadToEnd();" just to make sure that it really is being forced to decode the data, but that doesn't seem to affect anything.

What am I doing wrong?

Note: If I need to be doing anything special to deal with BOMs, please let me know that too. I'm inclined to ignore them if that's simple. (Some of these files have mixed encodings, BTW, though I would like to think that anything actually beginning with a BOM is pure unicode.)

Here is a Python script I'd created, which uses the same strategy and works fine:

def bad_encoding(filename, enc='utf-8', max=9):
'''Return a list of up to max error strings for lines in the file not encoded in the specified encoding. 

Otherwise, return an empty list.'''

errors = []
line = None
with open(filename, encoding=enc) as f:
    i = 0
    while True:
        try:
            i += 1
            line = f.readline()
        except UnicodeDecodeError:
            errors.append('UnicodeDecodeError: Could not read line {} as {}.'.format(i, enc))
        if not line or len(errors) > max:
            break

return errors
Jon Coombs
  • 2,135
  • 2
  • 25
  • 26
  • Have you tried setting the decoding fallback like this: `Encoding.GetEncoding(1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);` – Mike Zboray Jul 23 '14 at 02:30
  • Thanks! That looks great, and I've updated the code to use it. (Also found the MSDN docs now that I know what to search for.) However, it's still returning true in both cases. I'll try testing some more with more data files to make sure I'm not making a mistake on that end. – Jon Coombs Jul 23 '14 at 03:56
  • Hmmm... then it would help if you provided code that generated a file you think should throw an exception. – Mike Zboray Jul 23 '14 at 06:51
  • I'm realizing that part of my problem is that cp1252 will almost never "fail", whether or not it displays properly. I am now getting failures reading some legacy data as utf-8, so now I just need to try to break cp1252. Perhaps these characters will; not sure yet... http://www.i18nqa.com/debug/bug-double-conversion.html And in fact, if it's fail-proof, that's actually good news for my project: http://stackoverflow.com/a/2014087/1593924 – Jon Coombs Jul 23 '14 at 07:31
  • @mikez, if you want to paste your comment in as an answer, I'll accept it. Since that code correctly throws when reading legacy as utf-8, I assume that if any similar failures are in fact possible in the reverse case, it would correctly throw there as well. – Jon Coombs Aug 02 '14 at 16:18

1 Answers1

8

The static Encoding instances available through the Encoding class (Ascii, UTF8, Unicode, etc.) all try to make a best effort to decode the input bytes and do not throw if they fail.

To create an Encoding with a specific encode/decode behavior you should use the overload of Encoding.GetEncoding that takes EncoderFallback/DecoderFallback parameters. I tried creating instances of various encodings (AsciiEncoding, UTF8Endcoding) but they are read only, so setting the fallback options always threw an InvalidOperationException. In your case, to create an instance that throws when decoding fails, try:

encoding = Encoding.GetEncoding(encoding.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
Mike Zboray
  • 39,828
  • 3
  • 90
  • 122