0

I created a PHP script that allows me to upload a huge file of data from csv file. While importing, I'd like to replace the special character like to a letter c. Below is my code:

        $sql ="INSERT INTO bill_of_materials(allotment_code, category_name, activity, quantity, end_unit_quantity, unit, description,
        unit_cost, regular_labor_cost, end_unit_labor_cost, type, batch) VALUES";

        while (($line = fgets($handle)) !== false) {

          $sql .= "('".implode("', '", explode(";", sanitize($line)))."'),";
          $counter++;
        }

            $sql = substr($sql, 0, strlen($sql) - 1);
             if (mysqli_query($new_conn, $sql) === TRUE) {

                echo 1;

                //database file name
                $new_database_file = $new_database.'.sql';

                if(file_exists('backup/'.$new_database_file)) {

                    unlink('backup/'.$new_database_file);

                    // backup main database

                    $command = "C:/xampp/mysql/bin/mysqldump --host=$host --user=$user --password=$pass $database_name > backup/$new_database_file";
                    system($command);

                } else {
                    // backup main database

                    $command = "C:/xampp/mysql/bin/mysqldump --host=$host --user=$user --password=$pass $database_name > backup/$new_database_file";
                    system($command);
                }
            } else {
                echo $sql;
            }

In addition, I have a data from my CSV that is W2-A1 2/F Front Fa�ade - B and I'd like to see an output like W2-A1 2/F Front Facade - B. How can i do this?

Nibiru Nibiru
  • 67
  • 1
  • 8

1 Answers1

2

First of all, make sure you are using correct database client charset collation. If database charset/collation is correct, you may use preg_replace to sanitize dirty characters like so:

function sanitize($line){
   $clean = iconv('UTF-8', 'ASCII//TRANSLIT', $line); // attempt to translate similar characters
   $clean = preg_replace('/[^\w]/', '', $clean); // drop anything but ASCII
   return $clean;
}

If that won't help (e.g. you have truly corrupted binary stream - for example saving into CSV from old Excel source file) you may want to use binary translated characters (first you must find out corrupted binary sequence e.g. by dumping it via chr(ord($line[$position]))) - example:

function sanitize($line){
    $map = [
        // corrupted chars sequence -> fixed chars
        "\xC3\xA8" => 'č',
        "\xC3\x88" => 'Č',
        "\xC3\xB9" => 'ů',
        "\xC3\x99" => 'Ů',
        "\xC3\xAC" => 'ě',
        "\xC3\x8C" => 'Ě',
        "\xC3\xB8" => 'ř',
        "\xC3\x98" => 'Ř',
        "\x53\xC2\x8D" => 'Š',
        "\xC2\xA9" => 'Š',
    ];
    return str_replace(array_keys($map), $map, $line);
}
lubosdz
  • 4,210
  • 2
  • 29
  • 43
  • this works for me. thanks. – Nibiru Nibiru Aug 03 '17 at 07:55
  • 2
    This isn't a good solution. Fix the core issue instead of patching over it. If you set the entire pipeline of code to the correct charset, that will fix your issue - properly. The word "corrupted" is not correct in this context either, it's just a different encoding. – Qirel Aug 03 '17 at 08:01
  • @Qirel Please note that the very first sentence recommends to fix database client collation/charset. Suggested PHP functions are a fallback solution user cannot fix that. – lubosdz Aug 03 '17 at 08:08
  • Fixing the charset issue will prevent further encoding issues, this approach will still have broken encoding for any new data. It might be a workaround for data that's already of the wrong charset - but shouldn't substitute a proper fix for new data entries. – Qirel Aug 03 '17 at 09:21
  • @Qirel This question does not say anything whether source CSV files are or are not corrupted. It's your own assumption. It might be truly corrupted if e.g. saved from old Excel 2003 file with windows-1250 encoding into UTF-8. I had this case recently - my task was to import hundreds of old CSV files from third parties, while many files had corrupted encoding and no way to obtain fixed old files. Binary characters translation was the only way to fix it. Therefore setting proper database client connection charset may or may not fix the issue. – lubosdz Aug 03 '17 at 19:20