I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.
I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?
Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?
Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.
Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?
On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?