0

I ran into the BOM Unicode character when parsing a CSV file and found this neat solution that solved the problem.

//Remove UTF8 Bom
function remove_utf8_bom($text) {
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}

Link: How to remove multiple UTF-8 BOM sequences before "<!DOCTYPE>"?

However, I don't completely understand how this works and was wondering if someone could explain what's happening here.

Some questions that I have:

  1. Is 'EFBBBF' a HEX representation of the BOM Unicode character?
  2. What is H*? (I assume this is how we specify the format of the 'EFBBBF' string)
  3. Is it necessary to convert the 'EFBBBF' to a binary representation?
  4. When I try to print the $bom variable, it's just an empty string. Why is the BOM invisible?
  5. How does the preg_replace work with binary characters?
Community
  • 1
  • 1
Priyath Gregory
  • 927
  • 1
  • 11
  • 37

1 Answers1

2

The BOM is Unicode character U+FEFF.

EFBBBF is the hex representation of the UTF-8 encoding of this character. pack('H*', ...) takes a string and converts it into bytes assuming that each pair of characters in the string represent the byte value in hex.

Writing the BOM as the string EFBBBF makes it easier to type, but does mean you have to convert it to bytes, using pack in order to compare it to the BOM at the start of your data.

The BOM is invisible when you print it because U+FEFF is the Unicode character ZERO WIDTH NO-BREAK SPACE it's only used as a BOM if it's the first character in the file.

In order for this to work correctly $text has to be the raw UTF-8 data stream. If it has been decoded from UTF-8 to characters then you can skip most of this and simply write

$text = preg_replace("/^\uFEFF/", '', $text); 
JGNI
  • 3,933
  • 11
  • 21