I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export
does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80
. Instead, you entered the €
character which is 20AC
in Unicode and expressed as the three bytes xE2 x82 xAC
in UTF-8. The function mb_convert_encoding
doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD
character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents()
and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump
and var_export
are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see €
or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F
gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"