0

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?

I would like to illustrate the problem with an example.

$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');

var_dump($stringWin1252);    //string(17) "The price is 15 �"
var_export($stringWin1252);  // 'The price is 15 �'

The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.

The output with an improved var_export that I expect looks like this:

"The price is 15 \x80"

My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.

Another variant is to output all characters in hexadecimal PHP notation.

function strToHex2($str) {
    return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);

Output:

\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80

This variant is well suited for purely binary data, but quite large and difficult to read for general texts.

My question in other words:

How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

jspit
  • 7,276
  • 1
  • 9
  • 17
  • 1
    _"All unrecognized characters are replaced by the � symbol."_ - that is not PHP's doing, that is simply how the browser you are viewing this in displays a byte sequence that is not valid UTF-8. I can't tell what you actually want to achieve here. What do you gain by having this shown as `\x80` somewhere? Assuming you somehow want to work with this data later - would it then not make much more sense to convert it _into_ UTF-8, before you proceed processing it further ...? – CBroe Sep 24 '21 at 08:36
  • If you convert and output (even for practice) set the proper character encoding in the header: `header('Content-Type: text/html; charset=CP1252');` Then it outputs correct. – Michel Sep 24 '21 at 08:38
  • In the general case, I don't know the character encoding or the string contains binary data that has no character encoding. So it makes no sense to try to convert this to UTF8 or to set a corresponding header. I can simply copy a representation like "The price is 15 \x80" and then put it as code $string = "The price is 15 \x80", for example here in Stackoverflow in connection with a question or use it in a PHP-Online- Tool. – jspit Sep 24 '21 at 09:15
  • @jspit. Your question was about the `var_dump` and `var_export`, which output the correct characters after conversion if you set the right character encoding. If you want to know how to change all mutlibyte characters in hex, change the question or ask a new one. – Michel Sep 24 '21 at 09:35

1 Answers1

1

I'm going to start with the question itself:

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.

But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.

I received a string with an unknown character encoding via import.

That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.

How can I display such a string in the browser so that it can be reproduced as PHP code?

You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.

The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.

The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.

The string is only generated here with mb_convert_encoding for test purposes.

Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.

Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.

We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.

The output with an improved var_export that I expect looks like this: "The price is 15 \x80"

Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)

In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.

My approach to the solution is to find all non-UTF8 characters

That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.

How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.

The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.

When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)

But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.

If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:

$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
    $buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}

var_dump($buf);

// string(20) "The price is 15 \x80"
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thank you for the effort. Escaping anything above x7F would mean doing the same with many multibyte Unicode characters. Example: "law §15" would become "law \xc2 \xa715". However, I only want to escape characters that are not displayed in the browser or are displayed with the substitute character �. – jspit Sep 27 '21 at 07:23
  • Yeah, that's kind of the whole point of why I wrote all of that. Pretty much _all_ characters can be displayed in the browser, _as long as_ the appropriate font is available. PHP, via HTTP, will send byte sequences, the browser will interpret according to the encoding, and then do lookups based on available font data, falling back to one of the replacement characters. As long as you are Unicode through your pipeline, and are debugging with a font with enough characters, there shouldn't be a problem. _Getting_ to Unicode can be a pain, however. – Chris Haas Sep 27 '21 at 13:38