-1

I have a PHP application which generates a simple CSV file using league/csv. Some of the columns contains names/addresses which might have non-ANSI values. My client is requiring that the output CSV file be encoded in iso-8859-1 instead of utf-8 as it is currently.

I believe my problem can be reduced to the following (where response is from laravel):


        $headers = [
            'Content-type' => "text/csv; charset=iso-8859-1",
            'Content-Disposition' => 'attachment; filename="CLI.csv"'
        ];
        return response()->stream(function() {

            $fh = fopen('php://output', 'wb');
            fwrite($fh, "Vià Cittè\n");
            fwrite($fh, mb_convert_encoding("Vià Cittè\n", 'iso-8859-1'));
            fwrite($fh, mb_convert_encoding("Vià Cittè\n", 'iso-8859-1', 'utf-8'));
            fwrite($fh, iconv('utf-8', 'iso-8859-1', "Vià Cittè\n"));
            fwrite($fh, utf8_decode("Vià Cittè\n"));
            fwrite($fh, utf8_encode("Vià Cittè\n"));
            fclose($fh);

        }, 200, $headers);

I would expect at least some of the lines to be Vià Cittè\n encoded in iso-8859-1 but they all end up wrong. This is what I see when I open the output file using iso-8859-1 as encoding:

enter image description here

It appears that the output gets reencoded as utf-8 for some reason.

Can someone tell me how I can avoid having this reencoding issue?


In my real code I'm not writing directly using fopen, I use league/csv with its Writer and CharsetConverter. I have made various attempts but the result is the same as described above.

Note: I'm currently using PHP 7.3 on linux. The php server is inside a docker container behind an nginx proxy (which is in a different docker container).

GACy20
  • 949
  • 1
  • 6
  • 14
  • You can't trust your editor's rendering when (intentionally) mixing encodings. I suggest you test with only a couple of characters and an [hexadecimal editor](https://hexed.it/). Also, I advise to read the manual to see what the functions do since some names and signatures are very misleading. – Álvaro González Jan 11 '22 at 17:25
  • In **2022**: read and follow [UTF-8 Everywhere](https://utf8everywhere.org/) and [UTF-8 all the way through](https://stackoverflow.com/questions/279170/) – JosefZ Jan 11 '22 at 17:28
  • @JosefZ Unfortunately the issue is that the tool that uses the files we export cannot handle utf-8, so that's not an option. – GACy20 Jan 11 '22 at 19:18

1 Answers1

-1

You have several valid conversions, together with obvious random attempts. It's all a matter of doing some proper testing.

Raw Unicode UTF-8 ISO-8859-1
à U+00E0 LATIN SMALL LETTER A WITH GRAVE C3 A0 E0
è U+00E8 LATIN SMALL LETTER E WITH GRAVE C3 A8 E8
$utf8 = "\u{00E0}\u{00E8}";
var_dump($utf8, bin2hex($utf8));

$latin1 = [
    utf8_decode($utf8),
    iconv('UTF-8', 'ISO-8859-1', $utf8),
    mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8'),
];
var_dump(array_map('bin2hex', $latin1));

Assuming everything is configured to use UTF-8 (we aren't cavemen living in 1995) you'll see:

string(4) "àè"
string(8) "c3a0c3a8"
array(3) {
  [0]=>
  string(4) "e0e8"
  [1]=>
  string(4) "e0e8"
  [2]=>
  string(4) "e0e8"
}

I'd skip utf8_decode() because of its extremely confusing name (nobody checks the manual to see what it actually does). The other ones mainly differ on how they handle missing characters:

$utf8 = "€";
var_dump($utf8);

$latin1 = [
    iconv('UTF-8', 'ISO-8859-1', $utf8), # Notice: iconv(): Detected an illegal character in input string
    iconv('UTF-8', 'ISO-8859-1//IGNORE', $utf8),
    iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $utf8),
    mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8'),
];
var_dump(array_map('bin2hex', $latin1));
string(3) "€"
array(4) {
  [0]=>
  string(0) ""
  [1]=>
  string(0) ""
  [2]=>
  string(6) "455552" ------> EUR
  [3]=>
  string(2) "3f" ----------> ?
}
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • Doesn't work at all. `$utf8 = "Vià Cittè"; fwrite($fh, utf8_decode($utf8))` still returns utf-8 encoded text, same using `iconv` or `mb_convert_encoding` as described. Opening the downloaded file using python shows it is utf-8 and not latin1: `Vi\xc3\xa0 Citt\xc3\xa8\n` . I want the resulting file to be `Vi\xe0 Citt\xe8\n` – GACy20 Jan 12 '22 at 10:36
  • Logging this using `bin2hex` for the 3 solutions you proposed only the `utf8_decode` one produces the correct bytesequence: `5669e02043697474e80a` but this doesn't matter because the encoding changes to utf-8 on download. That's what I'm trying to fix. My nginx does not have any `charset` option defined for the server so it shouldn't touch the encoding of the responses, the headers say `Content-Type: text/csv; charset=iso-8859-1` and there is no other character encoding specified in other headers – GACy20 Jan 12 '22 at 10:47
  • Let's address one thing at a time. 1) `$utf8 = "Vià Cittè"` needs that your IDE is configured to save as UTF-8, check the hex dump in case or use `\u{00E0}` notation. 2) Can you reproduce the `utf8_decode()` / `mb_convert_encoding()` / `iconv()` mismatch in a standalone snippet with verified input and hex output? – Álvaro González Jan 12 '22 at 11:51