29

I was writing some commented PHP classes and I stumbled upon a problem. My name (for the @author tag) ends up with a ș (which is a UTF-8 character, ...and a strange name, I know).

Even though I save the file as UTF-8, some friends reported that they see that character totally messed up (È™). This problem goes away by adding the BOM signature. But that thing troubles me a bit, since I don't know that much about it, except from what I saw on Wikipedia and on some other similar questions here on SO.

I know that it adds some things at the beginning of the file, and from what I understood it's not that bad, but I'm concerned because the only problematic scenarios I read about involved PHP files. And since I'm writing PHP classes to share them, being 100% compatible is more important than having my name in the comments.

But I'm trying to understand the implications, should I use it without worrying? or are there cases when it might cause damage? When?

svick
  • 236,525
  • 50
  • 385
  • 514
treznik
  • 7,955
  • 13
  • 47
  • 59
  • Note that today I was having a problem where a ` – Volomike May 14 '12 at 15:12
  • Note also that session vars don't seem to work properly across pages when a page is having this UTF-8 BOM problem. I had to use a hex editor like ghex on Ubuntu plus `iconv -f utf8 -t ascii old.php > new.php` repeatedly to detect all the Unicode problems, remove them, and save the page finally in ASCII with no errors form the iconv command. Once that was done, I noticed session vars held state between pages. – Volomike May 14 '12 at 15:50
  • It appears that when UTF-8 BOM is detected in a file, headers are never sent that hold session, and therefore session variables between pages will get brand new sessions instead of holding the same session. – Volomike May 14 '12 at 16:00
  • it's not that PHP "detects" the BOM and - if present - "decides" to discard the session variables - the problem is that PHP (at least I have seen versions that do this) reads the file, reads an ï, prints it, reads a », prints it, reads a ¿, prints it... the problem now is that the session_start() causes some header-communication, which can only happen while we are still in the header-communication phase - and printing something ends this phase. if you had set the "display_errors" ini variable to "On", you would get a message telling you that session_start has failed because of this reason – Algoman Feb 02 '16 at 20:51

8 Answers8

26

Indeed, the BOM is actual data sent to the browser. The browser will happily ignore it, but still you cannot send headers then.

I believe the problem really is your and your friend's editor settings. Without a BOM, your friend's editor may not automatically recognize the file as UTF-8. He can try to set up his editor such that the editor expects a file to be in UTF-8 (if you use a real IDE such as NetBeans, then this can even be made a project setting that you can transfer along with the code).

An alternative is to try some tricks: some editors try to determine the encoding using some heuristics based on the entered text. You could try to start each file with

<?php //Úτƒ-8 encoded

and maybe the heuristic will get it. There's probably better stuff to put there, and you can either google for what kind of encoding detection heuristics are common, or just try some out :-)

All in all, I recommend just fixing the editor settings.

Oh wait, I misread the last part: for spreading the code to anywhere, I guess you're safest just making all files only contain the lower 7-bit characters, i.e. plain ASCII, or to just accept that some people with ancient editors see your name written funny. There is no fail-safe way. The BOM is definitely bad because of the headers already sent thing. On the other side, as long as you only put UTF-8 characters in comments and so, the only impact of some editor misunderstanding the encoding is weird characters. I'd go for correctly spelling your name and adding a comment targeted at heuristics so that most editors will get it, but there will always be people who'll see bogus chars instead.

skrebbel
  • 9,841
  • 6
  • 35
  • 34
  • Thanks for the advices. I understood where I stand and I think rather than the encoding detection heuristics, which is a kinda weird compromise, I'll do the decent choice and just spell my name with a "s" instead of a "ș", most of the possible coders don't even have that character in their language anyway. Right? :) – treznik Apr 01 '10 at 14:39
  • 4
    Browsers don't ignore the BOM. And these errors are hard to track. Never save PHP files with BOM. – hakre Aug 09 '11 at 11:52
  • No, because it's not a bug. The BOM is an abomination, don't use it. – skrebbel Mar 20 '18 at 10:53
  • 1
    It most certainly is a bug. PHP could easily "re-flow" it at the end of the header phase. There are many good reasons for BOMs, including the fact that despite having the technical means to store content encodings out-of-band alongside their files (inc. xattr/windows ADS) nothing really does so, so... we kinda NEED in-band methods, like BOMs and the TRUE abomination of . Also, it's just a magic number, like many encodings/file formats before it. – DimeCadmium Jun 18 '18 at 23:04
17

BOM would cause Headers already sent error, so, you can't use BOM in PHP files

Your Common Sense
  • 156,878
  • 40
  • 214
  • 345
11

This is an old post and have already been answered, but i can leave you some others resources that i found when i faced with this BOM issue.

http://people.w3.org/rishida/utils/bomtester/index.php with this page you can check if a specific file contains BOM.

There is also a handy script that outputs all files with BOM on your current directory.

<?php 
function fopen_utf8 ($filename) { 
    $file = @fopen($filename, "r"); 
    $bom = fread($file, 3); 
    if ($bom != b"\xEF\xBB\xBF") 
    { 
        return false; 
    } 
    else 
    { 
        return true; 
    } 
} 

function file_array($path, $exclude = ".|..|design", $recursive = true) { 
    $path = rtrim($path, "/") . "/"; 
    $folder_handle = opendir($path); 
    $exclude_array = explode("|", $exclude); 
    $result = array(); 
    while(false !== ($filename = readdir($folder_handle))) { 
        if(!in_array(strtolower($filename), $exclude_array)) { 
            if(is_dir($path . $filename . "/")) { 
                                // Need to include full "path" or it's an infinite loop 
                if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true); 
            } else { 
                if ( fopen_utf8($path . $filename) ) 
                { 
                    //$result[] = $filename; 
                    echo ($path . $filename . "<br>"); 
                } 
            } 
        } 
    } 
    return $result; 
} 

$files = file_array("."); 
?>

I found that code at php.net

Dreamweaver also helps with this, it gives you the option to save the file and not include the BOM stuff

Its a late answer, but i still hope it helps. Bye

Moak
  • 12,596
  • 27
  • 111
  • 166
omabena
  • 3,561
  • 1
  • 17
  • 13
9

Just so you know, there's an option in php, zend.multibyte, which allows php to read files with BOM without giving the Headers already sent error.

From the php.ini file:

; If enabled, scripts may be written in encodings that are incompatible with
; the scanner.  CP936, Big5, CP949 and Shift_JIS are the examples of such
; encodings.  To use this feature, mbstring extension must be enabled.
; Default: Off
;zend.multibyte = Off
solarc
  • 5,638
  • 2
  • 40
  • 51
4

In PHP, in addition to the "headers already sent" error, the presence of a BOM can also screw up the HTML in the browser in more subtle ways.

See Display problems caused by the UTF-8 BOM for an outline of the problem with some focus on PHP (W3C Internationalization).

When this occurs, not only is there usually a noticeable space at the top of the rendered page, but if you inspect the HTML in Firefox or Chrome, you may notice that the head section is empty and its elements appear to be in the body.

Of course viewing source will show everything where it was inserted, but the browser is interpreting it as body content (text) and inserting it there into the Document Object Model (DOM).

hakre
  • 193,403
  • 52
  • 435
  • 836
matthewv789
  • 99
  • 1
  • 4
2

Or you could activate output buffering in php.ini which will solve the "headers already sent" problem. It is also very important to use output buffering for performance if your site has significant load.

bobflux
  • 11,123
  • 3
  • 27
  • 27
2

BOM is actually the most efficient way of identifying an UTF-8 file, and both modern browsers and standards support and encourage the use of it in HTTP response bodies.

In case of PHP files its not the file but the generated output that gets sent as response so obviously it's not a good idea to save all PHP files with the BOM at the beginning, but it doesn't mean you shouldn't use the BOM in your response.

You can in fact safely inject the following code right before your doctype declaration (in case you are generating HTML as response):

<?="\u{FEFF}"?> (or before PHP 7.0.0: <?="\xEF\xBB\xBF"?>)

For further read: https://www.w3.org/International/questions/qa-byte-order-mark#transcoding

hakre
  • 193,403
  • 52
  • 435
  • 836
Szabolcs Páll
  • 1,402
  • 6
  • 25
  • 31
0

Adding to @omabena answer use this code to locate and remove bom from your files. Be sure to back up your files first just in case.

function fopen_utf8 ($filename) { 
    $file = @fopen($filename, "r"); 
    $bom = fread($file, 3); 
    if ($bom != b"\xEF\xBB\xBF") 
    { 
        return false; 
    } 
    else 
    { 
        return true; 
    } 
} 

function file_array($path, $exclude = ".|..|design", $recursive = true) { 
    $path = rtrim($path, "/") . "/"; 
    $folder_handle = opendir($path); 
    $exclude_array = explode("|", $exclude); 
    $result = array(); 
    while(false !== ($filename = readdir($folder_handle))) { 
        if(!in_array(strtolower($filename), $exclude_array)) { 
            if(is_dir($path . $filename . "/")) { 
                                // Need to include full "path" or it's an infinite loop 
                if($recursive) $result[] = file_array($path . $filename . "/", $exclude, true); 
            } else { 
                if ( fopen_utf8($path . $filename) ) 
                { 
                    //$result[] = $filename; 
                    echo ($path . $filename . "<br>"); 
                    $pathname = $path . $filename; // change the pathname to your target file(s) which you want to remove the BOM.
                    $file_handler = fopen($pathname, "r");
                    $contents = fread($file_handler, filesize($pathname));
                    fclose($file_handler);
                    for ($i = 0; $i < 3; $i++){
                        $bytes[$i] = ord(substr($contents, $i, 1));
                    }
                    if ($bytes[0] == 0xef && $bytes[1] == 0xbb && $bytes[2] == 0xbf){
                        $file_handler = fopen($pathname, "w");
                        fwrite($file_handler, substr($contents, 3));
                        fclose($file_handler);
                        printf("%s BOM removed.<br/>n", $pathname);
                    }
                } 
            } 
        } 
    } 
    return $result; 
} 

$files = file_array("."); 
energy2080
  • 21
  • 1