How to remove multiple UTF-8 BOM sequences

Question

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.

private function fetch($name) {
    $path = $this->j->config['template_path'] . $name . '.html';
    if (!file_exists($path)) {
        dbgerror('Could not find the template "' . $name . '" in ' . $path);
    }
    $f = fopen($path, 'r');
    $t = fread($f, filesize($path));
    fclose($f);
    if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
        $t = substr($t, 3);
    }
    return $t;
}

Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)

Any idea how to fix this? o_o

utf8 file shouldn't have a BOM, if your editor put those in, there should be a configuration to omit those, if your editor won't allow you to not put in BOM, replace your editor. — Lie Ryan, Apr 24 '12 at 02:11

score 176 · Accepted Answer · answered Mar 15 '13 at 02:55

176

you would use the following code to remove utf8 bom

//Remove UTF8 Bom

function remove_utf8_bom($text)
{
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}

answered Mar 15 '13 at 02:55

jasonhao

2,098
1
14
7

1

For some reason in the Google+ API, this BOM shows up at the end of the content variable, so I needed to tweak this to remove it from the end of the string. – Artem Russakovskii Mar 02 '17 at 18:08
1

Can someone explain how the pack function is used here? I know it converts a string to a binary representation but struggling to understand how this helps with identifying the BOM Unicode character. – Priyath Gregory Oct 03 '18 at 06:19
1

This worked great for my requirement to read the CSV output from SSRS and append to a larger file. – Trevor Dec 20 '18 at 19:28
I used this with `trim` to cleanse copy/pasted form data like this: `$bom = pack('H*','EFBBBF'); $replacementChars = " \n\r\t\v\0" . $bom; $cleanVar = trim($dirtyVar, $replacementChars);`. – Christopher Schultz Apr 13 '21 at 14:16
3

@fsociety The BOM is three bytes - `0xef 0xbb 0xbf`. So pack is is using a format of H* which means interpret all values in the string as hexadecimal bytes. I prefer o1max's answer (although has a lower score) that simply uses a string with escape characters:`"\xEF\xBB\xBF"` – Dan Apr 20 '21 at 19:50

score 57 · Answer 2 · answered Sep 18 '13 at 11:19

57

try:

// -------- read the file-content ----
$str = file_get_contents($source_file); 

// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str); 

// -------- get the Object from JSON ---- 
$obj = json_decode($str);

:)

answered Sep 18 '13 at 11:19

o1max

571
4
2

score 18 · Answer 3 · answered Jun 19 '14 at 17:03

18

Another way to remove the BOM which is Unicode code point U+FEFF

$str = preg_replace('/\x{FEFF}/u', '', $file);

answered Jun 19 '14 at 17:03

Dean Or

2,822
2
26
25

score 8 · Answer 4 · answered Apr 24 '12 at 02:07

8

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:

"\xef\xbb\xbf"

Your files also seem to contain a lot more garbage than just a single leading BOM:

$ curl http://ircb.in/jisti/ | xxd

0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef  ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068  .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561  tml>.<html>.<hea
...

answered Apr 24 '12 at 02:07

deceze

510,633
85
743
889

if I was using n++, why would it cause this? it's saving it as unix/utf8 -bom – sheppardzw Apr 28 '12 at 02:17
Save it as UTF-8 NO BOM (or whatever it's called in N++). – deceze Apr 28 '12 at 02:26
I did and I'm still getting the same result. I curl'd the direct file (curl http://ircb.in/jisti/home.html | xxd) and got no leading characters, but curl'ing the PHP script adds the excess data in the front and all I'm using is print to output the data. – sheppardzw Apr 28 '12 at 02:34

score 6 · Answer 5 · edited Nov 29 '19 at 05:56

6

if anybody using csv import then below code useful

$header = fgetcsv($handle);
foreach($header as $key=> $val) {
     $bom = pack('H*','EFBBBF');
     $val = preg_replace("/^$bom/", '', $val);
     $header[$key] = $val;
}

edited Nov 29 '19 at 05:56

Regolith

2,944
9
33
50

answered Jul 18 '18 at 06:10

phvish

149
1
8

score 5 · Answer 6 · answered Jun 22 '16 at 15:13

This global funtion resolve for UTF-8 system base charset. Tanks!

function prepareCharset($str) {

    // set default encode
    mb_internal_encoding('UTF-8');

    // pre filter
    if (empty($str)) {
        return $str;
    }

    // get charset
    $charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));

    if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
        $str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
    } else {
        $str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
    }

    // remove BOM
    $str = urldecode(str_replace("%C2%81", '', urlencode($str)));

    // prepare string
    return $str;
}

score 4 · Answer 7 · answered Nov 07 '16 at 04:53

An extra method to do the same job:

function remove_utf8_bom_head($text) {
    if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
        $text = substr($text, 3);
    }
    return $text;
}

The other methods I found cannot work in my case.

Hope it helps in some special case.

score 3 · Answer 8 · answered Feb 18 '19 at 09:06

3

A solution without pack function:

$a = "1";
var_dump($a); // string(4) "1"

function deleteBom($text)
{
    return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}

var_dump(deleteBom($a)); // string(1) "1"

answered Feb 18 '19 at 09:06

trank

952
11
8

if they can show up more than once, you might want to use"/^(\xEF\xBB\xBF)+/" – Scott Jul 31 '20 at 21:40

Kapitein Witbaard · Answer 9 · 2021-09-23T10:30:35.550

2

I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?

function remove_utf8_bom(string $text): string
{
    $bomStart = mb_substr($text, 0, 1);
    return ($bomStart == pack('H*','EFBBBF')) ?
        mb_substr($text, 1) :
        $text;
}

edited Sep 23 '21 at 10:30

answered Jul 05 '21 at 08:59

Kapitein Witbaard

998
9
18

score 1 · Answer 10 · answered Jul 12 '17 at 17:14

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).

>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>

In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:

>>> substr($json, 0, 3)
=> "  "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>

If the line above returns TRUE for you, then a simple test may fix the problem:

>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
     +"orgao": [
       {#203
         +"Nome": "Tribunal de Justiça",
         +"ID_Orgao": "59",
         +"Condicao": "1",
       },
     ],
     ...
   }

score 1 · Answer 11 · answered Nov 12 '22 at 06:49

1

How about this:

  function removeUTF8BomHeader($data) {
    if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
      $data = substr($data, 3);
    }

    return $data;
  }

tested a lot and it works perfect without any issue

answered Nov 12 '22 at 06:49

Panagiotis Koursaris

3,794
4
23
46

score 0 · Answer 12 · edited Nov 29 '19 at 05:51

0

When working with faulty software it happens that the BOM part gets multiplied with every saving.

So I am using this to get rid of it.

function remove_utf8_bom($text) {
    $bom = pack('H*','EFBBBF');
    while (preg_match("/^$bom/", $text)) {
        $text = preg_replace("/^$bom/", '', $text);
    }
    return $text;
}

edited Nov 29 '19 at 05:51

Regolith

2,944
9
33
50

answered Jun 09 '19 at 08:49

Juergen Schulze

1,515
21
29

How to remove multiple UTF-8 BOM sequences

12 Answers12

Linked

Related