11

I have a simple script that accepts a CSV file and reads every row into an array. I then cycle through each column of the first row (in my case it holds the questions of a survey) and I print them out. The survey is in french and whenever the first character of a question is a special character (é,ê,ç, etc) fgetcsv simply omits it.

Special characters in the middle of the value are not affected only when they are the first character.

I tried to debug this but I am baffled. I did a var_dump with the content of the file and the characters are definitely there:

var_dump(utf8_encode(file_get_contents($_FILES['csv_file']['tmp_name'])));

And here's my code:

if(file_exists($_FILES['csv_file']['tmp_name']) && $csv = fopen($_FILES['csv_file']['tmp_name'], "r"))
    {
        $csv_arr = array();

        //Populate an array with all the cells of the CSV file
        while(!feof($csv))
        {
            $csv_arr[] = fgetcsv($csv);
        }

        //Close the file, no longer needed
        fclose($csv);

        // This should cycle through the cells of the first row (questions)
        foreach($csv_arr[0] as $question)
        {
            echo utf8_encode($question) . "<br />";
        }

    }
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Gazillion
  • 4,822
  • 11
  • 42
  • 59
  • 3
    fgetcsv() is only binary-safe if you use plain ASCII - in other words, not at all. See http://stackoverflow.com/questions/3637770/why-fgetcsv-drops-some-characters-with-diacritics - basically, use fgets() to read the data, then parse CSV using a custom function. Apparently this also works: http://stackoverflow.com/questions/1472886/some-characters-in-csv-file-are-not-read-during-php-fgetcsv – Piskvor left the building Sep 03 '10 at 17:08

4 Answers4

8

Are you setting your locale correctly before calling fgetcsv()?

setlocale(LC_ALL, 'fr_FR.UTF-8');

Otherwise, fgetcsv() is not multi-byte safe.

Make sure that you set it to something that appears in your list of available locales. On linux (certainly on debian) you can see this by doing

locale -a

You should get something like...

C
en_US.utf8
POSIX

For UTF8 support pick an encoding with utf8 on the end. If your input is encoded with something else you'll need to use the appropriate locale - but make sure your OS supports it first.

If you set the locale to a locale which isn't available on your system it won't help you.

t0mm13b
  • 34,087
  • 8
  • 78
  • 110
Brock Batsell
  • 5,786
  • 1
  • 25
  • 27
  • Sorry if I come off as ignorant but what is mb-safe? I added the line with no effect to the behaviour of my script. The manual says that the function is binary safe since PHP 4.3.5 (we have php 5 installed) – Gazillion Feb 10 '10 at 17:53
  • 2
    Multi Byte Safe = able to handle encodings in which a single character can consist of more than one byte (e.g. UTF-8). – Pekka Feb 10 '10 at 18:01
  • This resolves the problem for me as long as the input is UTF-8, but the problem persists for other 8-bit encodings. – eswald Jan 26 '12 at 20:05
  • 1
    Great answer - are there any drawbacks in setting locale to an UTF-8 encoding on a whole project instead of just for `fgetcsv()`? – Horen May 16 '13 at 08:29
2

This behaviour has a bug report filed for it, but apparently it isn't a bug.

David Johnstone
  • 24,300
  • 14
  • 68
  • 71
1

Have you already checked out the manual page on fgetcsv? There is nothing talking about that specific problem offhand, but a number of contributions maybe worth looking through if nothing comes up here.

There's this, for example:

Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.

Also, seeing as it's always in the beginning of the line, could it be that this is really a hidden line break problem? There's this:

Note: If PHP is not properly recognizing the line endings when reading files either on or created by a Macintosh computer, enabling the auto_detect_line_endings run-time configuration option may help resolve the problem.

You may also want to try saving the file with different line endings.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • I've read the manual page on how to use the function and a quick search through the comment area didn't pop up anything for special characters or utf-8 encoding. I had noticed that it could have trouble with UTF-8 encoding but if I don't encode the values the value still doesn't show up. I'm not sure if there would be another way to get around this. I tried using "|" as an end of line delimiter and I get the same problem. This is very confusing :) – Gazillion Feb 10 '10 at 17:49
1

We saw the same result with LANG set to C, and worked around it by ensuring that such values were wrapped in quotation marks. For example, the line

a,"a",é,"é",óú,"óú",ó&ú,"ó&ú"

generates the following array when passed through fgetcsv():

array (
  0 => 'a',
  1 => 'a',
  2 => '',
  3 => 'é',
  4 => '',
  5 => 'óú',
  6 => '&ú',
  7 => 'ó&ú',
)

Of course, you'll have to escape any quotation marks in the value by doubling them, but that's much less hassle than repairing the missing characters.

Oddly, this happens with both UTF-8 and cp1252 encodings for the input file.

eswald
  • 8,368
  • 4
  • 28
  • 28