0

Good day!

I think I read almost all the questions related to PHP and BOM, still I did not find a suitable answer to my problem. So here I am:

I have a PHP script (loader.php), the first time it is run it generates a configuration file (_config.php) In this script I just store some variables concerning the environement of the first call. If the _config.php file already exists I require it in loader.php

Everything works fine but the problem is that _config.php needs to be created as UTF8. The only way it worked for me, wrt this question, was with

file_put_contents(
    $folder, 
    "\xEF\xBB\xBF".$phpCommands
);

Of course this adds the BOM and I read it when I use the require function the second time loader.php is called, generating in the end the well known extra space issue at the beginning of the page. I tried to remove it from the final output of the page using the method suggested here but it doesn't affect the result, probably because the BOM is inserted via require and not via fopen or similar.

All my PHP scripts are UTF-8 (without BOM). The generated _config.php is UTF-8 "with BOM".

To solve the problem I have two solutions but I can't figure out how to make them work:

  1. Create a UTF8-encoded file without BOM (streams, iconv is not an option because of old PHP)
  2. require_once the file removing the BOM

Can someone help me out?

Please, don't suggest me alternative strategies to generate/store the configuration. It has to be done this way.

Thanks a lot!

Community
  • 1
  • 1
Fabbio
  • 343
  • 2
  • 16
  • 2
    `"\xEF\xBB\xBF"` doesn't change the encoding of `$phpCommands`, just tells that the file should be treated as UTF-8. So `$phpCommands` must already be encoded in UTF-8 and if you are going to remove the BOM anyway, the solution is to not add it in the first place. – Fabian Schmengler Aug 13 '15 at 09:01
  • 3
    PHP has no concept of character encoding, it treats all strings as arrays of bytes. If your `$phpCommands` string already is UTF-8-encoded text, you don't need the BOM. If it isn't, adding BOM won't magically make it so. – lafor Aug 13 '15 at 09:06
  • Thanks for your replies! Well, if I try: file_put_contents( $folder, utf8_encode($phpCommands) ); the resulting file has no encoding (at least none that I can see with Notepad++). While what I see adding the BOM is that is is recognized as UTF-8 @AD7six : I mean I have to generate the config file. One might argue that I can store this in a DB or something else. – Fabbio Aug 13 '15 at 09:29
  • You may want to read this: [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) – deceze Aug 13 '15 at 09:32

1 Answers1

2

Simply do not add the BOM when creating the file. It serves no purpose as such.

The most likely explanation for your "the only way it worked for me" is simply a bad testing method, no more, no less. Meaning, your file was being created with UTF-8 perfectly fine, just whatever method you used to confirm that was flawed. I'll guess here that you opened the resulting file in some text editor, and that editor told you the file encoding was "ANSI" or "ASCII" or such.

Well, a plain text file does not declare its encoding anywhere. Your text editor was just taking a guess as to its encoding. If the file contents are just plain English/ASCII, there's no difference whatsoever between ASCII, ANSI and UTF-8. Your text editor simply told you one of the possible answers, where any answer is equally valid. Adding a BOM explicitly places a hint at the beginning of the file that the encoding is UTF-8, which the editor picked up on.

This, or something similar, is likely your entire non-issue.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • bad testing method it was... Thanks for the help! – Fabbio Aug 13 '15 at 10:06
  • 1
    @Fabbio Tip: the only testing method that makes sense with regards to encodings is to assert that a given file is valid in a certain encoding. I.e., never try to *figure out* what encoding a file is in, only *test* whether it's in the encoding you think it is. For example, using iconv on the command line: `iconv -f UTF-8 file.txt` – if that works without errors, your file is UTF-8 encoded. – deceze Aug 13 '15 at 10:09
  • Ok, but wait a second: now if I insert manually in the _config.php file a non-ASCII character (say à) it won't be render correctly in the *loader.php* (question mark). Do you mean it's the Editor that sets/changes the encoding when I safe the file? – Fabbio Aug 13 '15 at 13:59
  • The editor decides which encoding to safe the file in; in other words what bytes to dump into the file. If your editor decides to save the file as, say, ISO-8859-1 ("Latin-1"), then your file won't be UTF-8 encoded. It may have been before, but it's not after your editor decides to save it as something else. That, or you're not handling the encoding correctly in your application and that's where it screws up. – deceze Aug 13 '15 at 15:04
  • Ok, the reason why I added the BOM was that in that way the editor was recognizing the text in the file as UTF-8... Now I tried to save "ààà".$phpCommands in the _config.php and it is correctly displayed in the page. And magically Notepad++ identify the file as UTF-8 without BOM. So my guess here is that as soon as there is a NON-ASCII char the Editor guesses UTF-8. The trick to make it work with Notepad++ is probably add a non ASCII code in the text file. – Fabbio Aug 14 '15 at 05:42
  • Or simply tell your editor that you'd like to treat the file as UTF-8 if it doesn't pick it up itself. You shouldn't have to bend over backwards to satisfy the needs of your tools. If a file is encoded in UTF-8 it's encoded in UTF-8, no matter what some text editor says. Your deduction about it being able to better guess the encoding when there are unambiguous characters (non-ASCII) in there is correct. – deceze Aug 14 '15 at 06:35