1

We have some internal dashboards with PHP backend used for uploading CSV files. Recently we found some CSVs would fail to parse: the fgetcsv function returns false, which is super nasty since we couldn't determine the actual problem in CSV (for e.g. at which line no it is experience issues, which characters is it unable to digest etc.)


We narrowed down the problem to character-set encoding: CSVs generated from Windows machines were failing. Linux's iconv command was able to fix the CSVs for us

iconv -c --from-code=UTF-8 --to-code=ASCII path/to/uncleaned.csv > path/to/cleaned.csv

while it's PHP equivalent didn't work (tried using both //IGNORE//TRANSLIT options).

$uncleaned_csv_text = file_get_contents($source_data_csv_filename);
$cleaned_csv_text = iconv('UTF-8', 'ASCII/IGNORE//TRANSLIT', $uncleaned_csv_text);
file_put_contents($source_data_csv_filename, $cleaned_csv_text);
..
$headers = fgetcsv($source_data_csv_filename)

While we can use PHP's exec function to run the shell command

  • it is less than ideal
  • the practise is forbidden in our organisation from security viewpoint (Travis doesn't let it pass through)

Is there any alternative way to achieve this CSV 'cleaning'?


UPDATE-1

We explored several other options, none of which worked for us


UPDATE-2

  • Upon echoing the sha1 digest of CSV's text before and after subjecting it to PHP's iconv function, we found that iconv is not doing any change
  • Also in my case, mb_check_encoding on original CSV's text outputs true regardless of input query: windows-1252, ascii, utf-8
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • 1
    UTF-8 is an ASCII superset. Characters relevant in CSV have identical encoding in either charsets. The only difference between the iconv command and library is that the former supports Unicode BOM and will remove it. But as far as I know the only issue with BOM is that it'll show up among actual data. You'll have your reasons to give up on diagnosing the root issue but `fgetcsv()` isn't a black box. It only reads a line at a time so it should be easy to find out which line fails. And you can determine the exact encoding by inspecting the actual raw bytes in the file. – Álvaro González Apr 19 '20 at 11:11
  • 1
    "Caution If and how //TRANSLIT works exactly depends on the system's iconv() implementation (cf. ICONV_IMPL). Some implementations are known to ignore //TRANSLIT, so the conversion is likely to fail for characters which are illegal for the out_charset. " - From the manual: https://www.php.net/manual/en/function.iconv.php – bestprogrammerintheworld Apr 19 '20 at 11:21
  • Do you have the actual CSV-file (or parts of it) available for public view? – bestprogrammerintheworld Apr 20 '20 at 18:10
  • I've added an update to my solution which is a more complete solution which I hope you will find useful. – bestprogrammerintheworld Apr 21 '20 at 05:45

2 Answers2

1

I've been working on a plugin (in Wordpress) that handles csv-files for a while now (with a 5/5 star rating) and I've been using mb_convert_encoding() with no issues. I know I have users that uses both windows and Linux.

Basically: (TO UTF-8, FROM: Windows-1252)

$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8', 'Windows-1252');

If you don't know the original's encoding (maybe better in your case):

(TO UTF-8)

$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8');

UPDATE: Here is a more complete answer which I hope you will find useful: I've used file() together with str_getcsv() etc:

<?php
$file = "csvfiles/pricelist.csv"; //This is Windows-1252 encoded

//Load a csv file into an array $content_arr
$content_arr = array_map(function($v) {
    $delimiter = ';';
    return str_getcsv($v, $delimiter);},
    file( $file )); 

//Do encoding row by row
//and include end of line based on the item in the array $content_arr
$csv = array_map(function($v) {
    return mb_convert_encoding($v[0], 'UTF8','Windows-1252') . detect_eol($v[0]);},
    $content_arr);

//Save modified file in UTF8
file_put_contents('csvfiles/pricelist_modified.csv', $csv);

//Detects the end-of-line character of a string.
//
//function from 
//https://stackoverflow.com/questions/11066857/detect-eol-type-using-php/11066858#11066858
function detect_eol( $str )
{
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
        "\r\n",
        "\r",
        "\n"
    );
    $cur_cnt = 0;
    $cur_eol = "\r\n"; //default

    //Check if eols in array above exists in string
    foreach($eols as $eol){     
        $char_cnt = mb_substr_count($str, $eol);

        if($char_cnt > $cur_cnt)
        {
            $cur_cnt = $char_cnt;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}
bestprogrammerintheworld
  • 5,417
  • 7
  • 43
  • 72
  • If you omit `$from_encoding` it merely defaults to `mb_internal_encoding()`, which is likely to be `UTF-8` but may be something different. What's the exact intention of that piece of code? – Álvaro González Apr 20 '20 at 16:25
  • If your omit encoding it tries to detect the actual encoding of the string. I have never had and issue with that. The intention is if you dont know the originals files encoding it is possible to convert it to the characterset you want (UTF-8 in this case). But if your are sure about the original charset it is probably better to supply both from and to charset:) – bestprogrammerintheworld Apr 20 '20 at 17:37
-2

This can't be deemed as a solution (since we didn't even determine the root cause of the problems), but rather a hack.


We asked people using Windows machines to do this for uploading CSVs

  • Upload CSV to Google Sheet (File > Import) enter image description here
  • Download the created Sheet back as a CSV (File > Download > Comma-separated values) enter image description here
  • Then use that downloaded CSV for uploading on dashboard

Credits to @RohitChandana for suggesting this workaround

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131