2

I am a bit of a newbie with encoding. I have a php file that allows users to upload csv files.

My issue is that when the file is created using excel for mac if the file contains utf-8 characters such as accented letters, my code will not work properly. Basically it will ignore the accented characters.

The problem occurs only when the file is saved using the Comma separated values option.

In all the other cases such as file made in windows or using open office or even excel on mac but saving them as 'windows' file do not cause any problem.

mb_detect_encoding returns false for the file causing troubles.

here is the code:

// say there is the word Nestlé in the file
$content = file_get_contents(addslashes($file_name));

var_dump(mb_detect_encoding($content)); // print false


$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
            //$data  = utf8_encode($content);  //doesn't work

var_dump($data); // print Nestl

ini_set('auto_detect_line_endings',TRUE);

// more code here we don't need at the moment

This question gave me some indications: file_get_contents() Breaks Up UTF-8 Characters

Any help or idea on how to solve this problem? Thank you in advance

here is the new bit of code after the response posted by Anthony

$content = file_get_contents(addslashes($file_name));
// i have no control on how the file is generated so i need to to the replace in the code
$content = str_replace(",", "\t",  $content);
var_dump($content);
$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));


$data =  mb_convert_encoding($data, 'UTF-16LE', 'UTF-8');
$data = chr(255) . chr(254) . $data;

var_dump($data); // this still print funny characters not the accented letter

Am i doing something wrong?

Community
  • 1
  • 1
mikey
  • 1,339
  • 5
  • 22
  • 43
  • 1
    This is known issue for excel on osx, the lack of Unicode support for csv files. There's a workaround, I'll post in a few. You aren't doing another wrong, it's excel – Anthony Jan 29 '15 at 03:35
  • @Anthony that would be a nice birthday present for me :) – mikey Jan 29 '15 at 03:44

2 Answers2

2

This is an issue specific to Excel and more common on Excel for Mac, where UTF-8 multibyte characters are not properly displayed. You can confirm by using a different spreadsheet viewer, like Google Sheets.

The workaround for this is:

  1. Use tabs (\t) instead of commas as delimiter (don't worry, it's still technically a CSV).

  2. After encoding to utf-8, convert the entire csv string to UTF-16LE :

    mb_convert_encoding($csv_content, 'UTF-16LE', 'UTF-8');

  3. Prefix the csv string with a little-endian byte-order-mark (LE BOM) :

    $csv_content = chr(255) . chr(254) . $csv_content;

And this should do it.

Anthony
  • 36,459
  • 25
  • 97
  • 163
  • see my edit, i can't get your workaround to work, can you spot anything wrong in my code? maybe it's mine line where i convert to utf-8 that is not working – mikey Jan 29 '15 at 04:25
  • 1
    I actually got your issue backwards. My solution is for outputting a csv that will work for Excel on Mac, not for importing. Give me a few minutes to see if the solution works in reverse. – Anthony Jan 29 '15 at 04:31
  • 1
    Hmmmm, I created a test csv using your example, Nestlé. It looks like this is encoded using Mac OS Roman, which requires `iconv` to convert to utf-8: http://stackoverflow.com/questions/4722864/how-can-i-convert-western-mac-os-roman-formatted-text-to-utf-8-with-php – Anthony Jan 29 '15 at 04:41
  • 1
    My guess is that characters that are within Mac OS Roman code block (like é) are encoded using that character set, but likely if characters are not in that set (like Japanese characters) are encoded using UTF-16 or some other encoding. So you should use the `mb_detect_encoding` and if it comes back false then assume it's Mac OS Roman. – Anthony Jan 29 '15 at 04:44
  • that is what i have just finished doing! thank you very much for helping me out! – mikey Jan 29 '15 at 04:57
0

All right, thank you Anthony, here is the line that will fix it:

$data = iconv('macintosh', 'UTF-8', $content);

so my final code will look something like this:

enter code here

$content = file_get_contents(addslashes($file_name));

var_dump(mb_detect_encoding($content));
// need to do this for an issue specific to Excel and more common on Excel for Mac
// using excel on mac if the file is saved as csv using the Comma separated values option we need to use iconv and not mb_convert_encoding
// we use mb_detect_encoding because the content of such file returns a false value
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
     //$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', 'macintosh', true));

      $data = iconv('macintosh', 'UTF-8', $content);


 } 
    // deal with known encoding types
 else{
         $data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
 }
mikey
  • 1,339
  • 5
  • 22
  • 43