0

On Linux server if user uploads a CSV file created in MS Office Excel (thus having Windows 1250 [or cp1250 or ASCII if you want to] encoding) all to me known methods of detecting the file encoding return incorrect ISO-8859-1 (or latin1 if you want to) encoding.

This is crucial for the encoding conversion to final UTF-8.

Methods I tried:

  • cli
    • file -i [FILE] returning iso-8859-1
    • file -b [FILE] returning iso-8859-1
  • vim
    • vim [FILE] and then :set fileencoding? returning latin1
  • PHP
    • mb_detect_encoding(file_get_contents($filename)) returning (surprisingly) UTF-8

while the file is indeed in WINDOWS-1250 (ASCII) as proves i.e. opening the CSV file in LibreOffice - Math asks for file encoding and selecting either of ISO-8859-1 or UTF-8 results in wrongly presented characters while selecting ASCII displays all characters correctly!

How to correctly detect the file encoding on Linux server (Ubuntu) (best if possible with default Ubuntu utilities or with PHP)?

The last option I can think of is to detect the user agent (and user OS) when uploading the file and it is windows then automatically assume the encoding is ASCII...

shadyyx
  • 15,825
  • 6
  • 60
  • 95
  • Possible duplicate of [How can I detect the encoding/codepage of a text file](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file) – JiriS Jan 20 '16 at 12:52
  • You need to tell the user which character set and encoding to use or let them tell you which the file uses. Anything else is incomplete communication and is therefore data loss. How would you know that it's not CP437 or ISO-8859-1? (If they are using Excel, why not let them upload an Excel file or HTML or Office 2003 XML? These would tell you internally which encoding is used.) – Tom Blodget Jan 21 '16 at 00:34
  • @TomBlodget Users can only upload CSV files but these are in 99.99% created in MS Excel. I just cannot rely on one encoding as there still is this 0.01%. – shadyyx Jan 21 '16 at 09:17

0 Answers0