We have some internal dashboards with PHP backend used for uploading CSV files. Recently we
found some CSVs would fail to parse: the fgetcsv
function returns false
, which is super nasty since we couldn't determine the actual problem in CSV (for e.g. at which line no it is experience issues, which characters is it unable to digest etc.)
We narrowed down the problem to character-set encoding: CSVs generated from Windows machines were failing. Linux's iconv
command was able to fix the CSVs for us
iconv -c --from-code=UTF-8 --to-code=ASCII path/to/uncleaned.csv > path/to/cleaned.csv
while it's PHP equivalent didn't work (tried using both //IGNORE//TRANSLIT
options).
$uncleaned_csv_text = file_get_contents($source_data_csv_filename);
$cleaned_csv_text = iconv('UTF-8', 'ASCII/IGNORE//TRANSLIT', $uncleaned_csv_text);
file_put_contents($source_data_csv_filename, $cleaned_csv_text);
..
$headers = fgetcsv($source_data_csv_filename)
While we can use PHP's exec
function to run the shell command
- it is less than ideal
- the practise is forbidden in our organisation from security viewpoint (
Travis
doesn't let it pass through)
Is there any alternative way to achieve this CSV 'cleaning'?
UPDATE-1
We explored several other options, none of which worked for us
regex
based cleaningforceutf8
packagemb_convert_encoding
(as suggested by discussions)
UPDATE-2
- Upon
echo
ing thesha1
digest of CSV's text before and after subjecting it to PHP'siconv
function, we found thaticonv
is not doing any change - Also in my case,
mb_check_encoding
on original CSV's text outputstrue
regardless of input query:windows-1252
,ascii
,utf-8