0

I have an input file in my asp.net application. The user submit a CSV file to update the database.

This CSV file is created by exporting a .xlsx.
This .xlsx file contains non-ascii chars, such as França, Rússia, etc.
The user sometimes, incorrectly saves it via "CSV (MS-DOS)" (which writes ASCII format) instead of "CSV (comma separated file)" (which preserves .xlsx encoding).

So, to validate file encoding before write its content in the database....

How can I safely detect file encoding of a file submitted in .net?

ps.: BOM verification is not enough. A file can be UTF w/ BOM.

mason
  • 31,774
  • 10
  • 77
  • 121
Andre Figueiredo
  • 12,930
  • 8
  • 48
  • 74
  • I don't think it's a duplicate, he doesn't want to detect the code page, he wants to detect whether the file has been saved properly. – zmbq Dec 16 '13 at 20:25
  • @Andre, create an excel file, put some non-ASCII characters in it and save as an MS-DOS CSV file. What becomes of the non-ASCII characters? Are those question marks? – zmbq Dec 16 '13 at 20:26
  • @zmbq: the question asks: "How can I safely detect file encoding of a file [submitted in .net]?" That is a duplicate question. Encodings are implemented as codepages in Windows. Detecting a file's encoding/codepage, regardless of how the file was created, is not 100% reliable if the encoding/codepage is not specified in the file itself or in its metadata, you have to prompt the user for it. – Remy Lebeau Dec 17 '13 at 02:07

1 Answers1

4

How can I safely detect file encoding of a file submitted in .net?

You can't.

Excel's "CSV" saving comes out in the machine's ANSI code page, and "CSV (MS-DOS)" comes out in the OEM code page. Both these encodings vary from machine to machine and they're never anything useful like UTF-8 or UTF-16. (Indeed, on some East Asian machines, they may not even be fully ASCII-compatible.)

You might be able take a guess based on heuristics. For example if França is a common value in the documents you handle, you could detect its common encodings:

                                                    F  r  a  n  ç  a
Code page 1252 (ANSI on Western European machines): 46 72 61 6e e7 61
Code page 850  (OEM  on Western European Machines): 46 72 61 6e 87 61

If you don't have any constant patterns like that the best you can do is arbitrary guessing (see this question). Either way it hardly qualifies as 'safe'.

CSV as a format does not have a mechanism for declaring encoding, and there isn't a de-facto standard of just using UTF-8. So it can't really be used as a mechanism for transferring non-ASCII text with any degree of reliability.

An alternative you could look at would be to encourage your users to save from Excel as "Unicode text". This will give you a .txt file in the UTF-16LE encoding (Encoding.Unicode in .NET terms), which you can easily detect from the BOM. The content is TSV, so same quoting rules as CSV but with tab separators.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834