Detect if a file is ISO-8859-1/Unicode (or not ASCII)

Question

I have an input file in my asp.net application. The user submit a CSV file to update the database.

This CSV file is created by exporting a .xlsx.
This .xlsx file contains non-ascii chars, such as França, Rússia, etc.
The user sometimes, incorrectly saves it via "CSV (MS-DOS)" (which writes ASCII format) instead of "CSV (comma separated file)" (which preserves .xlsx encoding).

So, to validate file encoding before write its content in the database....

How can I safely detect file encoding of a file submitted in .net?

ps.: BOM verification is not enough. A file can be UTF w/ BOM.

I don't think it's a duplicate, he doesn't want to detect the code page, he wants to detect whether the file has been saved properly. — zmbq, Dec 16 '13 at 20:25
@Andre, create an excel file, put some non-ASCII characters in it and save as an MS-DOS CSV file. What becomes of the non-ASCII characters? Are those question marks? — zmbq, Dec 16 '13 at 20:26
@zmbq: the question asks: "How can I safely detect file encoding of a file [submitted in .net]?" That is a duplicate question. Encodings are implemented as codepages in Windows. Detecting a file's encoding/codepage, regardless of how the file was created, is not 100% reliable if the encoding/codepage is not specified in the file itself or in its metadata, you have to prompt the user for it. — Remy Lebeau, Dec 17 '13 at 02:07

score 4 · Accepted Answer · edited May 23 '17 at 12:11

How can I safely detect file encoding of a file submitted in .net?

You can't.

Excel's "CSV" saving comes out in the machine's ANSI code page, and "CSV (MS-DOS)" comes out in the OEM code page. Both these encodings vary from machine to machine and they're never anything useful like UTF-8 or UTF-16. (Indeed, on some East Asian machines, they may not even be fully ASCII-compatible.)

You might be able take a guess based on heuristics. For example if França is a common value in the documents you handle, you could detect its common encodings:

                                                    F  r  a  n  ç  a
Code page 1252 (ANSI on Western European machines): 46 72 61 6e e7 61
Code page 850  (OEM  on Western European Machines): 46 72 61 6e 87 61

If you don't have any constant patterns like that the best you can do is arbitrary guessing (see this question). Either way it hardly qualifies as 'safe'.

CSV as a format does not have a mechanism for declaring encoding, and there isn't a de-facto standard of just using UTF-8. So it can't really be used as a mechanism for transferring non-ASCII text with any degree of reliability.

An alternative you could look at would be to encourage your users to save from Excel as "Unicode text". This will give you a .txt file in the UTF-16LE encoding (Encoding.Unicode in .NET terms), which you can easily detect from the BOM. The content is TSV, so same quoting rules as CSV but with tab separators.

Detect if a file is ISO-8859-1/Unicode (or not ASCII)

1 Answers1