I build tools to analyze source code. Such tools have to read the source code files correctly, especially as regards character encodings. For example, "What is the precise string of bytes in a string literal?" (both PHP literals, and HTML text).
My perhaps erroneous understanding is that PHP source files are 8-bit character only (that is, the PHP engine reads them that way [right]?, since they are only supposed to contain 8 bit characters). But, eight bit characters in which encoding? (I presume intended to match ISO-8859-1 (-x?) [can somebody quote chapter and verse?]. That is, an umlaut is intended to be an umlaut, right? Following this, one can write PHP scripts with HTML and strings for most European nations/character sets straightforwardly.
But it is clear this is problematic with Unicode. As far as I can tell, most PHP applications deal with Unicode essentially by having strings containing UTF-8 byte sequences which can be inserted in 8-bit PHP strings. Following this, one can generate scripts whose HTML contains Unicode UTF-8 sequences, if you tell your server you are generating UTF-8 text.
For the above situations, one can read the PHP file as 8-bit character text, and this seems to me to match the language.
What puzzles me are PHP source files encoded as UTF-8 (the Joomla package has ~1800 source files, of which some 10 are UTF-8 and the rest are not). Any (non-ASCII) European characters that show correctly in a UTF-8 rendering are actually encoded as multibyte sequences. I suppose such pages served as UTF-8 will have the HTML rendered correctly. But any string comparisons for European characters or other Unicode characters that apparently render correctly in a text editor simply won't work. And string literals will not contain what they appear to contain. Do programmers use UTF-8 files because that's what editors offer? Are they doing this on purpose? Or is just an accident that doesn't matter for most work?
So, how should one read a PHP source file? (in particular, in what character encoding?) One possible answer is, always as ISO-8859-1 8 bit codes, regardless of the actual content or BOMs (I see a lot UTF-8 BOM-marked PHP files). Another answer is as UTF-8, if so marked.
[Our tools read and write arbitrary encodings. A "trivial" tool is read-file-in-one-character encoding, write identical code points in another encoding. Reading UTF-8 PHP files that way, gets us into trouble writing ISO8859-1 equivalent files, because many UTF-8 code points (e.g., the euro symbol) cannot be encoded in ISO8859-x.]
EDIT Aug 30: We now check PHP files to see if the have UTF-8 BOMs, or appear to have UTF-8 sequences that are all legal. In either of these cases, we read the file as UTF-8; otherwise we read it as ISO8859-1 by default. We now preserve the file encoding if we modify it. (Getting all this right is quite a lot of work). This seems to be a safe strategy, but that may be different than what PHP programmers are expecting.