PHP source code in UTF-8 files; how to interpret properly?

Question

I build tools to analyze source code. Such tools have to read the source code files correctly, especially as regards character encodings. For example, "What is the precise string of bytes in a string literal?" (both PHP literals, and HTML text).

My perhaps erroneous understanding is that PHP source files are 8-bit character only (that is, the PHP engine reads them that way [right]?, since they are only supposed to contain 8 bit characters). But, eight bit characters in which encoding? (I presume intended to match ISO-8859-1 (-x?) [can somebody quote chapter and verse?]. That is, an umlaut is intended to be an umlaut, right? Following this, one can write PHP scripts with HTML and strings for most European nations/character sets straightforwardly.

But it is clear this is problematic with Unicode. As far as I can tell, most PHP applications deal with Unicode essentially by having strings containing UTF-8 byte sequences which can be inserted in 8-bit PHP strings. Following this, one can generate scripts whose HTML contains Unicode UTF-8 sequences, if you tell your server you are generating UTF-8 text.

For the above situations, one can read the PHP file as 8-bit character text, and this seems to me to match the language.

What puzzles me are PHP source files encoded as UTF-8 (the Joomla package has ~1800 source files, of which some 10 are UTF-8 and the rest are not). Any (non-ASCII) European characters that show correctly in a UTF-8 rendering are actually encoded as multibyte sequences. I suppose such pages served as UTF-8 will have the HTML rendered correctly. But any string comparisons for European characters or other Unicode characters that apparently render correctly in a text editor simply won't work. And string literals will not contain what they appear to contain. Do programmers use UTF-8 files because that's what editors offer? Are they doing this on purpose? Or is just an accident that doesn't matter for most work?

So, how should one read a PHP source file? (in particular, in what character encoding?) One possible answer is, always as ISO-8859-1 8 bit codes, regardless of the actual content or BOMs (I see a lot UTF-8 BOM-marked PHP files). Another answer is as UTF-8, if so marked.

[Our tools read and write arbitrary encodings. A "trivial" tool is read-file-in-one-character encoding, write identical code points in another encoding. Reading UTF-8 PHP files that way, gets us into trouble writing ISO8859-1 equivalent files, because many UTF-8 code points (e.g., the euro symbol) cannot be encoded in ISO8859-x.]

EDIT Aug 30: We now check PHP files to see if the have UTF-8 BOMs, or appear to have UTF-8 sequences that are all legal. In either of these cases, we read the file as UTF-8; otherwise we read it as ISO8859-1 by default. We now preserve the file encoding if we modify it. (Getting all this right is quite a lot of work). This seems to be a safe strategy, but that may be different than what PHP programmers are expecting.

What's the specific question here? I'm not sure what you're asking, specifically. — Anderson Green, Aug 03 '13 at 23:40
I've bolded the specific question. If I call a modern library facility to open a file which is a PHP source file, what encoding do I tell that libary the file uses? Or, what sets of rules do I follow, to figure out what the encoding of the file actually is? If you have never noticed the character encoding of the PHP file you edited, this question won't make any sense to you. — Ira Baxter, Aug 03 '13 at 23:58
Ira, regarding the deletion, I asked about it on Meta: http://meta.stackexchange.com/questions/195362/is-the-roomba-broken . The system auto-deletes unanswered downvoted questions after 30 days, which I thought only applied to closed questions. That's what happened here. — Brad Larson, Aug 31 '13 at 19:12
@BradLarson: Thanks for stepping in. I guess the auto-delete scheme doesn't serve questions like this well; it is a less useful question just because it isn't answered (yet)? The downvote here IMHO isn't justified; I'm guessing it is some PHP-lover that hates the idea that this is a messy question. And I was just considering a bounty when it got deleted. — Ira Baxter, Sep 01 '13 at 03:34
@IraBaxter You tell it to open the file as a binary. PHP files are - in first approximation - binary. Do not try to read them in some particular encoding. *Especially* don't try to read them as UTF-8. Don't strip BOMs. PHP strings are definitely binary. Trying to read them as UTF-8 will break things. (Honestly, your post reads like you're totally confused here and I still don't entirely understand your problem. Reading PHP files is easy as it doesn't need encoding handling. Maybe you're coming from a Python background where the idea of not specifying encodings all over the place is foreign?) — NikiC, Sep 04 '13 at 10:24
What people don't seem to understand is there is no such thing as *no encoding*. The "encoding" tells you how characters are represented (7 bits? 8 bits? 16 bits? multibyte?) and for each code point, what character it represents. PHP appears, after this discussion to have no *standard* encoding, but rather a peculiar one of its very own: an 8 bit representation, with codes 0x0A,0x0D,0x20-0x7E having characters matching those of ASCII. ... — Ira Baxter, Sep 04 '13 at 14:02
.... codes 0x7F and up harder to intepret. Just exactly does 0xDF represent? Is it a German "strasse" (as in ISO8859-1) or extended equal sign (as in ISO8859-8) or something else (some other of zillion existing encodings)? Is it a letter? If a PHP identifier is made of letters and digits, is this character allowed in a PHP identifier or not? I suspect the truth is that the intepretation of codes above 0x7E are determined only by the Zend-PHP lexer's regular expressions. You can say PHP doesn't actually define it (but the regexes classify them). Hard to write a PHP program for Europe. — Ira Baxter, Sep 04 '13 at 14:05
@NikiC: I come from a background where people designing the programming languages paid attention to character set issues. Our tools process dozens of languages. With the notable exception of PHP, all the modern versions are clear about character sets (Python, Java, even old languages such as C and C++). PHP stands out in this crowd. How am I to write a static analyzer, that can determine if a PHP program matches a single input keystroke, shown as a Strasse in my favorite editor? How can I write a more sophisticated analyzer if I can't get this right? Yes, I am professionally confused. — Ira Baxter, Sep 04 '13 at 14:13
@IraBaxter I really do mean "no encoding" when it comes to strings. A 0xDF in a PHP string has no intrinsic meaning: It's just the byte 0xDF. PHP does not care what this represents. It can just as well be a random byte for an IV as it can be a German "strasse" (whatever that is). PHP strings are byte-arrays, no more than that. For source code analysis you should handle them as such. If you wish to output them to the user, you can only guess at how to present them (UTF-8 is common in *professional* settings.) — NikiC, Sep 04 '13 at 18:35
@IraBaxter PHP is totally clear about character sets, in the same way that C and C++ are clear about them. The very first sentence of the strings documentation tells you so, and expands on things here: http://de1.php.net/manual/en/language.types.string.php#language.types.string.details It also mentions Zend Multibyte which - in the very rare case that somebody uses it - can be used to change the direct mapping between bytes in the source code and the strings resulting from it. — NikiC, Sep 04 '13 at 18:41
@IraBaxter (Again assuming no Zend Multibyte) Regarding keywords (like `array`), those are compared according to ASCII (7 bit). Identifiers are matched on bytes, just like everything else (the byte ranges for them are documented). If you wonder which rules are used for case-insensitive matching (e.g. keywords again), you can find the exact to-lower mapping here: http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_operators.c#45 (locale-insensitive). Anyway, I feel like this is a bit too much for the comments section ^^ — NikiC, Sep 04 '13 at 18:46
Why is this being downvoted? It is clearly a serious, confusing issue about PHP. Even John Skeet is surprised at the answer :-} — Ira Baxter, Feb 11 '15 at 01:22

boen_robot · Accepted Answer · 2021-04-23T14:15:45.457

TL;DR

ASCII

Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

If there's none...

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.

Edge cases (requested in comments):

1.

Is there a restriction on what encoding are allowed?

There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.

2.

How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. ~~This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16)~~ While this may be a problem for PHP files encoded as UTF-16, I think it's fair to say the vast majority of developers just don't encode their scripts in UTF-16. Their data, sure, if the application's case calls for it. But not the script itself. Most PHP files in the wild are encoded either with an ANSI encoding or UTF-8.

3.

If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.

4.

For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).

5.

If not, what does it mean to set the encoding to UTF something?

AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".

Is there a restriction on what encoding are allowed? How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)? If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up? For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)? If not, what does it mean to set the encoding to UTF something? — Ira Baxter, Sep 01 '13 at 22:29
@IraBaxter I've altered my answer above to try and answer these questions. — boen_robot, Sep 02 '13 at 07:38
"This shouldn't be a problem even in a UTF-16 file... right?" Wrong. You can't read UTF-16 as if it were monobyte, it's a two-byte (and four-byte) encoding — Walter Tross, Apr 12 '14 at 14:27

score 9 · Answer 2 · edited Jun 20 '20 at 09:12

As repeated already many times, PHP files do not have any encoding for bytes above x7f. All you can tell is that the bytes x00 to x7f are ascii.

A file with a BOM marker at the beginning is not valid PHP. So there is nothing like a PHP file in iso-8859-1 or utf-8. It is plain 8-bit.

A PHP file is not iso-8859-x, because those encodings do not contain all possible byte values. As you know x7f to x9f are not valid in iso-8859-1, but any PHP file can possibly contain them.

A PHP file is not utf-8 either, because it might contain invalid utf-8 sequences, without being invalid.

The big picture

Charset by convention at writing

A PHP file can have an encoding by convention, but this is up to the discretion of the programmer. He will tell his editor, that such project is in utf-8 or iso-8859-1 or what else.

But again, this is only a convention of the programmer. His editor is threating the PHP file as if it were in such and such encoding. The encoding is merely serving the purpose of displaying the file in the editor and allows the programmer to edit it.

No charset during compilation

As explained above, the compiler does not need to know the encoding the programmer assumed. The only thing that matters is what are the byte sequences in the file.

Implicit or explicit charset defined on consumption

PHP generates some data that is sent over internet to the browser. At the time the browser displays the data, the encoding is definitely defined, but how ?

The encoding can be defined in the HTTP header, like this Content-Type: text/html; charset=utf-8
It can be defined in the HTML output itself: <meta charset="utf-8">
Or if the charset is not defined explicitely, the browser makes an educated guess depending on the byte sequences present in the document (e.g. valid utf-8 sequences or BOM).

Of course it is good practice that an PHP application never lets the browser choose, but there is no requirement that the encoding be defined anywhere.

More details

Normally, the encoding the programmer chooses will be the same which will be used at the end of the chain in the browser, and all strings in the PHP-files will use this same encoding.

But this needs not be the case. There are valid reasons, why this will not be the case. Let's look at examples:

Different languages, different encodings

I use Joomla since it's version 1.0. In this version, the language files had each their own encoding. The french language was iso-8859-1, while the arab files were windows-1256 and russian files koi8-r. For those encoding mattered, but not for all other files, which could be treated equally as utf-8 or iso-5598-1. (Meanwhile, Joomla switched to utf-8.)

Heterogeneous databases

One of our web application connects to two different databases, one happens to be in utf-8, the other one in windows-1252. This means, that all the strings in this project are not in the same encoding. I use utf-8 as much as possible, but I need to thanslate the encodings back and forth using the mb_*group of functions in PHP.

PHP's conversion functions

Merely the presence of the encoding conversion functions mb_convert_encoding, iconv, utf8_encode, etc. suggests that in the same project string of different encodings can be present.

Good practice

Define your encoding and stick to it ! The best choice will be the use of utf-8. If other strings of other encodings are needed, you can always write something like $s=mb_convert_encoding('Уровень','ucs-2','utf8');

Here again: You cannot use BOM markers in PHP. The reason is simple: A BOM marker ar two bytes that come before the opening tag <?php. They are therefore sent to the browser. If one tries to send a header() afterwards, an error is generated, and the header is not sent.

Conclusion

In general, there is no need to determine the encoding of a PHP file. Only the encoding of the finally rendered HTML-file is important.
It is good practice to edit all files in the same encoding that is used to display the final results. But it really only matters for the language files (if you use any system of i18n at all).
While in practice all the strings in one file are in the same encoding, nothing would keep an ill minded programmer to write strings in different encodings in the same file, and still get a working program.

Finally encoding in PHP is only a matter of convention used at writing time, and the charset used in the browser to render the page. In between, a PHP file has no specific encoding, it's just plain 8-bit.

Your answer is concise and well-written, and I was torn about awarding the bounty to you or boen_robot. He provided some alternative details about encoding, and responded earlier than you, so I have award him the bounty. I appreciate your response very much. — Ira Baxter, Sep 08 '13 at 03:09

score 3 · Answer 3 · edited May 23 '17 at 12:32

3

There is really no way to reliably tell a PHP source file's encoding. It could be anything really. As you know, the only generic identifier is the BOM, but most people will remove those from their source files as they can cause trouble at output time.

How to deal with this depends on what you want to do. Usually, it doesn't matter because the PHP file will take care of declaring its encoding itself, e.g. by sending a Content-type header (or it is defined implicitly, e.g. because it's part of a project whose convention it is to use a certain encoding). The issue of encoding doesn't really come up because the file sorts it out itself at execution time.

If you're building a tool that manipulates or analyzes PHP source files in some form, chances are the encoding doesn't really matter, but we'd have to know more about your situation to assess that.

The way most IDEs deal with this uncertainty is they ask the developer to manually specify which encoding the project, folder, and / or file are in. Maybe that is an option for you as well.

edited May 23 '17 at 12:32

Community

1
1

answered Aug 31 '13 at 19:03

Pekka

442,112
142
972
1,088

1

While the encoding *might* be anything, only one matters. How does the PHP interpreter deal with the file encoding? You can't tell the interpreter [can you?] how to encoding-interpret the file, and the execution interpretation is the only one that matters since it determines the semantics. And if the intepreter handles it in only one (non-UTF8) way, then when a UTF file is read an interpreted, it isn't interpreted according to the character codes seen an a UTF8 editor. Why doesn't this make everybody crazy? And how is it that Joomla seems to work with some files in UTF-8 encoding? – Ira Baxter Sep 01 '13 at 03:28
@Ira I don't think the PHP interpreter has reason to *care* about the encoding. Function/variable names and the like *can* exceed the ASCII range (`function ÄÖÜß()` is valid), but I guess PHP will simply interpret the bytes and not care about the encoding at all. So internally, the funcion name will be `C3 84 C3 96 C3 9C C3 9F` (the UTF-8 bytes that make up `ÄÖÜß`) to PHP and that's that. Where specifically do you see the need to care about the encoding in your project or whatever you're working on? – Pekka Sep 01 '13 at 03:38
Yes, I understand that the interpreter might pick up 3 byte sequences in identifier names without being bothered, if it read UTF-8 as ISO-8859-1. I'm principally concerned with strings and their content. – Ira Baxter Sep 01 '13 at 13:39
@Ira that will never happen. The operator characters are all within the first 128 (Single byte values). [Single bytes, leading bytes, and continuation bytes do not share values.](http://en.wikipedia.org/wiki/UTF-8) – Pekka Sep 01 '13 at 13:45
@Ira re your edited comment: The interpreter itself has no reason to worry about the strings' encoding. If PHP is fed a UTF-8 string, it will simply pass through the byte values as they are. Whether they are interpreted correctly at output time is the responsibility of the developer, and not the interpreter's concern. Basic PHP *functions* are not multibyte aware; that's why the `mb_` family of functions exists: http://us3.php.net/mbstring But long story short, you have no 100% reliable way of telling what encoding a bunch of strings is that sit in a PHP file. You will have to ask the user. – Pekka Sep 01 '13 at 13:47
Yes, I understand that the interpreter might pick up 3 byte sequences in identifier names without being bothered, if it read UTF-8 as ISO-8859-1. I'm principally concerned with strings and their content. As an example, if using a UTF-8 capable editor, one writes a PHP string containing the symbol "€" (Euro currency) code point 20ac, what does the interpreter think the string contains and why? [Joomla does this]. How is it that this character gets rendered correctly? What if the character is in the HTML part of the text? [ooops you responded awfully quickly] – Ira Baxter Sep 01 '13 at 13:50
@Ira the interpreter just stupidly reads the bytes and passes them through. It doesn't care about the encoding. The browser receives the bytes as they are and renders the page. If the encoding the browser is using for the page doesn't match the encoding the character is in, you see a broken character. Does that make it more clear? – Pekka Sep 01 '13 at 13:52
1

What you seem to be asserting is that the PHP interpreter *always* reads the file as 8 bit characters, which is rather what I expect. That means regardless of how they are encoded from the editor's point of view, they should be processed by a language analysis as if they were in an 8 bit encoding. OK, *which* 8 bit encoding? I suspect any of the ISO-8859-x encodings are fine as I doubt the PHP interpreter actually looks for any specific character code > 0x80. – Ira Baxter Sep 01 '13 at 13:54
@Ira maybe I'm misunderstanding (which is entirely possible), but I think you should be completely encoding-agnostic here - assume no encoding at all, and ask the user to specify one where needed. After all, I could have a `function tüst()` defined as UTF-8 (`74 C3 BC 73 74`) or as ISO-8859-1 (`74 FC 73 74`) and both would be perfectly fine function names to PHP. Which encoding is the "right" one matters only when *showing the function name to a human being*. For that, you will have to ask the user to specify an encoding, as there is no way to detect an encoding with total accuracy. – Pekka Sep 01 '13 at 14:06
...that said, you can of course guess by trying to sniff the encoding using something like http://www.php.net/manual/de/function.mb-detect-encoding.php#68607 (converted to whatever language your analysis tool is in, of course) but keep in mind this is extremely unreliable. – Pekka Sep 01 '13 at 14:08
2

I don't see how an interpreter *can* be encoding-agnostic. It needs to understand at least *some* characters - symbols for operators, for example. The difference between UTF-8 and UTF-16 would be very important there, for example. Imagine a sequence of bytes 0x00 0x3c 0x00 0x3d 0x00. Is that "<=" or "\0 < \0 ="? It depends on the encoding. – Jon Skeet Sep 01 '13 at 15:58
@Jon fair enough, didn't think beyond ISO-8859-1 (and other single-byte encodings where the first 127 characters are the same) and UTF-8. I guess the basic set of symbols PHP needs, it expects to be in the ASCII range. – Pekka Sep 01 '13 at 16:05
The symbol can be part of ASCII but not encoded in the same way as ASCII. Also, it would affect any functions which determine the length of a string etc... it really does make a huge difference, IMO. – Jon Skeet Sep 01 '13 at 16:10
@Jon the length of strings is a separate issue. That's what the [`mb_*` family of functions](http://www.php.net/manual/en/ref.mbstring.php) is there to work around. Apart from a convention which byte value the essential symbols need to be, I don't see how encoding plays into it here. – Pekka Sep 01 '13 at 16:21
1

@Pekka웃: Eek, so strings are really byte-sequences in PHP rather than text? Ick ick ick and sigh. – Jon Skeet Sep 01 '13 at 16:23
@Jon yeah, it's one of the great issues of the language. Native Unicode support is planned for V6, but nobody knows when that's going to come - if ever – Pekka Sep 01 '13 at 16:24
@Pekka웃, The engine definitely needs to assume an encoding. Take the string `f()` for example. When encoded in UTF-8, it is `66 28 29`, but when encoded in UTF-16BE it is `00 66 00 28 00 29`. The engine needs to know or guess an encoding to correctly decipher the bits in the parse phase. Same goes for the other symbols like `;`, `'`, `{`, `}`, `\ `, `->`, `=>`, etc etc. – Pacerier Oct 27 '14 at 00:47