Determine File Encoding for Large Files in SSIS / C#

Question

I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.

I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.

using(var r = new StreamReader(filename, Encoding.Default))
{
    richtextBox1.Text = r.ReadToEnd();
    var encoding = r.CurrentEncoding;
}

I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?

Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?

Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.

Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?

On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?

Well, UTF-16 (/UCS-2) is easy to detect if the characters *"are all English"* - in that case, every second byte is zero. As far as I know, ASCII is a pure subset of UTF-8 (and, I believe, Windows 1252). Good luck trying to distinguish between UTF-8 and 1252 without a BOM — Flydog57, Apr 15 '20 at 14:36
I guess all I really need to do is be able to determine whether to apply a 1252 or 65001 code page to the file. Is there a way to reliably do this? — hyphen, Apr 15 '20 at 14:49

Determine File Encoding for Large Files in SSIS / C#

0 Answers0