I want my C# application (which has a GUI) to help the user choose between "unicode (utf-8)" and "legacy (cp1252)". I would like to give the user two independent true/false readings regarding whether the file can be 'successfully' (though not necessarily correctly) read in in those two formats with no loss of detail.
When I tried the following in C#, it didn't work. That is, it seems to always return true, even if I call it on a utf-8 text file that I know contains non-Roman characters.
[EDIT: Actually, I shouldn't have thought this should fail. Could be one of those reasonable successes that happens to be incorrect, since most (all?) byte streams are also valid cp1252. Testing the other direction does find invalid utf-8 as the Python code below does.]
E.g. CanBeReadAs("nepali.txt", Encoding.GetEncoding(1252)) ought to return false, but it returns true.
public static bool CanBeReadAs(string filePath, Encoding encoding)
{
// make it strict:
encoding = Encoding.GetEncoding(encoding.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
using (var r = new StreamReader(filePath, encoding, false))
{
try
{
r.ReadToEnd();
}
catch (Exception e)
{
//swallow
return false;
}
}
return true;
}
I've also tried it with "string s = r.ReadToEnd();" just to make sure that it really is being forced to decode the data, but that doesn't seem to affect anything.
What am I doing wrong?
Note: If I need to be doing anything special to deal with BOMs, please let me know that too. I'm inclined to ignore them if that's simple. (Some of these files have mixed encodings, BTW, though I would like to think that anything actually beginning with a BOM is pure unicode.)
Here is a Python script I'd created, which uses the same strategy and works fine:
def bad_encoding(filename, enc='utf-8', max=9):
'''Return a list of up to max error strings for lines in the file not encoded in the specified encoding.
Otherwise, return an empty list.'''
errors = []
line = None
with open(filename, encoding=enc) as f:
i = 0
while True:
try:
i += 1
line = f.readline()
except UnicodeDecodeError:
errors.append('UnicodeDecodeError: Could not read line {} as {}.'.format(i, enc))
if not line or len(errors) > max:
break
return errors