17

I want to delete the BOM from my imported file, but it just doesn't seem to work.

I tried to preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $file); and a str_replace.

I hope anybody sees what I'm doing wrong.

$filepath = get_bloginfo('template_directory')."/testing.csv";
            setlocale(LC_ALL, 'nl_NL');
            ini_set('auto_detect_line_endings',TRUE);
            $file = fopen($filepath, "r") or die("Error opening file");
            $i = 0;
            while(($line = fgetcsv($file, 1000, ";")) !== FALSE) {
                if($i == 0) {
                    $c = 0;
                    foreach($line as $col) {
                        $cols[$c] = utf8_encode($col);
                        $c++;
                    }
                } else if($i > 0) {
                    $c = 0;
                    foreach($line as $col) {
                        $data[$i][$cols[$c]] = utf8_encode($col);
                        $c++;
                    }
                }
                $i++;
            }

-----------
SOLVED VERSION:

setlocale(LC_ALL, 'nl_NL');
ini_set('auto_detect_line_endings',TRUE);
require_once(ABSPATH.'wp-admin/includes/file.php' );

$path = get_home_path();        
$filepath = $path .'wp-content/themes/pon/testing.csv';
$content = file_get_contents($filepath); 
file_put_contents($filepath, str_replace("\xEF\xBB\xBF",'', $content));

// FILE_PUT_CONTENTS AUTOMATICCALY CLOSES THE FILE
$file = fopen($filepath, "r") or die("Error opening file"); 

$i = 0;
while(($line = fgetcsv($file, 1000, ";")) !== FALSE) {
    if($i == 0) {
        $c = 0;
        foreach($line as $col) {
            $cols[$c] = $col;
            $c++;
        }
    } else if($i > 0) {
        $c = 0;
        foreach($line as $col) {
            $data[$i][$cols[$c]] = $col;
            $c++;
        }
    }
    $i++;
}

I found that it removes the BOM and adjusts the file by overwriting it with the new data. The problem is that the rest of my script doesn't work anymore and I can't see why. It is a new .csv file

Owen Pauling
  • 11,349
  • 20
  • 53
  • 64
Interactive
  • 1,474
  • 5
  • 25
  • 57
  • `$cols[$c]` inside your first foreach is pointless. `$cols` is a COPY of whatever line/field you're processing. you need `foreach($lines as $key => $col) { $lines[$key] = utf8_encode($col); }` – Marc B Aug 24 '15 at 14:39
  • 2
    PHP docs comment for fgetcsv has a nice answer, https://www.php.net/manual/en/function.fgetcsv.php#122696 - open the file, read and move filepointer and check if first 3 bytes are equal to the BOM string, if not, rewind and then proceed with fgetcv – jave.web Jun 19 '19 at 21:56
  • [the Same issue has been solved here](https://stackoverflow.com/questions/5396560/how-do-i-convert-special-utf-8-chars-to-their-iso-8859-1-equivalent-using-javasc) fixedstring = decodeURIComponent(escape(utfstring)); – olivia Oct 03 '19 at 19:55

7 Answers7

25

Try this:

function removeBomUtf8($s){
  if(substr($s,0,3)==chr(hexdec('EF')).chr(hexdec('BB')).chr(hexdec('BF'))){
       return substr($s,3);
   }else{
       return $s;
   }
}
Tomasz
  • 4,847
  • 2
  • 32
  • 41
  • It gives me this: `Warning: substr() expects parameter 1 to be string, resource given` – Interactive Aug 24 '15 at 15:25
  • What are you passing into this function? It should be like that: `$file = 'something.csv';` `$content = file_get_contents($file);` `var_dump(removeBomUtf8($content));` And then start processing this file. – Tomasz Aug 24 '15 at 15:38
  • 1
    in this line: `$content = file_get_contents($file);` change `$file` to `$filepath` – Tomasz Aug 24 '15 at 17:28
  • Okay this is some progress. Thnx. I now get a string with al my csv data without the BOM. Awesome. If I remove the `var_dump` and let my script continue with `while(($line = fgetcsv(removeBomUtf8($content), 1000, ";")) !== FALSE) {` It gives me a blank page with no error or progress. Any ideas? – Interactive Aug 25 '15 at 07:41
  • I slightly changed your idea ( don't know if it is the best but it works.) I found that `file_put_contents` closes the file so I just had to reopen it. Thanks for your help – Interactive Aug 25 '15 at 11:31
  • 2
    To remove UTF16 Little Endian BOM `(substr($s, 0, 2) == chr(0xFF).chr(0xFE))` – Nolwennig Jun 20 '18 at 09:16
  • This is correct answer, You have to send single parameter in string only that contains BOM. and it'll work – Nitin Vaghani Dec 04 '19 at 16:53
6

Isn't the BOM there to give you a clue on how to reencode the input to something your script/app/database needs? Just deleting isn't gonna help.

This is how I force a string (drawn from a file with file_get_contents()) to be encoded in UTF-8 and get rid of the BOM as well:

switch (true) { 
    case (substr($string,0,3) == "\xef\xbb\xbf") :
        $string = substr($string, 3);
        break;
    case (substr($string,0,2) == "\xfe\xff") :                            
        $string = mb_convert_encoding(substr($string, 2), "UTF-8", "UTF-16BE");
        break;
    case (substr($string,0,2) == "\xff\xfe") :                            
        $string = mb_convert_encoding(substr($string, 2), "UTF-8", "UTF-16LE");
        break;
    case (substr($string,0,4) == "\x00\x00\xfe\xff") :
        $string = mb_convert_encoding(substr($string, 4), "UTF-8", "UTF-32BE");
        break;
    case (substr($string,0,4) == "\xff\xfe\x00\x00") :
        $string = mb_convert_encoding(substr($string, 4), "UTF-8", "UTF-32LE");
        break;
    default:
        $string = iconv(mb_detect_encoding($string, mb_detect_order(), true), "UTF-8", $string);
};
Lisa
  • 101
  • 1
  • 4
  • 1
    I like this, except that UTF-32LE will never be detected because UTF-16LE will trigger it first. Longest comparisons should be at the top. – bwaindwain Nov 30 '22 at 22:49
6

Correct way is to skip BOM if present in file (https://www.php.net/manual/en/function.fgetcsv.php#122696):

ini_set('auto_detect_line_endings',TRUE);
$file = fopen($filepath, "r") or die("Error opening file");
if (fgets($file, 4) !== "\xef\xbb\xbf") //Skip BOM if present
        rewind($file); //Or rewind pointer to start of file

$i = 0;
while(($line = fgetcsv($file, 1000, ";")) !== FALSE) {
    ...
}
AndreyP
  • 2,510
  • 1
  • 29
  • 17
4

If the character encoding functions don't work for you (as is the case for me in some situations) and you know for a fact that your file always has a BOM, you can simply use an fseek() to skip the first 3 bytes, which is the length of the BOM.

$fp = fopen("testing.csv", "r");
fseek($fp, 3);

You should also not use explode() to split your CSV lines and columns because if your column contains the character by which you split, you will get an incorrect result. Use this instead:

while (!feof($fp)) {
    $arrayLine = fgetcsv($fp, 0, ";", '"');
    ...
}
voidmind
  • 137
  • 1
  • 6
  • 5
    If you can not be sure if there is a BOM Marker, better check for it and rewind if not: `if (!fread($handle, 3)==chr(0xEF).chr(0xBB).chr(0xBF)) { rewind($handle); }` instead of the `fseek` – DanielW Sep 19 '19 at 14:50
1

Read data with file_get_contents then use mb_convert_encoding to convert to UTF-8

UPDATE

$filepath = get_bloginfo('template_directory')."/testing.csv";
$fileContent = file_get_contents($filepath);
$fileContent = mb_convert_encoding($fileContent, "UTF-8");
$lines = explode("\n", $fileContent);
foreach($lines as $line) {
    $conls = explode(";", $line);
    // etc...
}
MrRP
  • 822
  • 2
  • 10
  • 25
  • @Interactive `file_get_contents` reads whole file. `explode` it by "\n" or "\r\n". It gives back an array. Then walk through on this array. – MrRP Aug 24 '15 at 15:19
  • If I run this it gives me an Array where the 'titlefields' are in the first array and every following Array contains information per person. This is great but I have no idea how to use this for what I'm doing. So I guess I'll be pulling a all nighter. – Interactive Aug 24 '15 at 16:03
  • I slightly changed your idea ( don't know if it is the best but it works.) I found that `file_put_contents` closes the file so I just had to reopen it. Thanks for your help – Interactive Aug 25 '15 at 11:31
1

Check this solution, this solved my case: https://www.php.net/manual/en/function.str-getcsv.php#116763

$bom = pack('CCC', 0xEF, 0xBB, 0xBF);
if (strncmp($yourString, $bom, 3) === 0) {
    $body = substr($yourString, 3);
}
József Takó
  • 183
  • 1
  • 7
0

Using @Tomas'z answer as the main inspiration for this, and @Nolwennig's comment:

// Strip byte order marks from a string
function strip_bom($string, $type = 'utf8') {
    $length = 0;

    switch($type) {
        case 'utf8':
            $length = substr($string, 0, 3) === chr(0xEF) . chr(0xBB) . chr(0xBF) ? 3 : 0;
        break;

        case 'utf16_little_endian':
            $length = substr($string, 0, 2) === chr(0xFF) . chr(0xFE) ? 2 : 0;
        break;
    }

    return $length ? substr($string, $length) : $string;
}
Danny Beckett
  • 20,529
  • 24
  • 107
  • 134