PHP outputs a strange french character | é = string(3)

Question

When PHP outputs file names from an FTP folder it produces French characters which are 3 characters long, so when we var_dump:

var_dump("é");

It shows:

string(3)

But the actual character should be

string(2)

The file names are pulled using a Wordpress function

When it's string(3) we can't do a preg_match on it to replace it with a standard ASCII character.

I tried declaring the formatting as UTF-8, but it's already UTF-8. Also tried

header('Content-Type: text/html; charset=iso-8859-1');

But the result is garbled text.

Is there anything else we can try? What kind of a character is it?

This sounds very similar to http://stackoverflow.com/a/1725329/7496329 — Andy, Feb 01 '17 at 21:43
Make sure you wrap the filename in the echo with htmlspecialchars - http://php.net/manual/en/function.htmlspecialchars.php — Brogan, Feb 01 '17 at 21:48
`string(1)` is a 1-byte string; UTF-8 is a __multi-byte__ character set, so 1 character does not equate to 1 byte — Mark Baker, Feb 01 '17 at 21:50
Possible duplicate of [preg\_match and UTF-8 in PHP](http://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php) — miken32, Feb 01 '17 at 21:52
@MarkBaker You're right it should be string(2), but for some reason I'm getting string(3) which doesn't work with preg_match, I'm reading the other suggested replies to see if I can use them. Thank you for the comments everyone — Robert Sinclair, Feb 01 '17 at 21:59
It doesn't look as though you have a UTF-8 `é`: [take a look at the differences between your character and a UTF-8 `é`](https://3v4l.org/FGSrb) — Mark Baker, Feb 01 '17 at 22:23
@MarkBaker wow thank you, i'm definitely keeping this function for future use. Do you know if there's an automatic way to convert these é to their UTF-8 equivalents? Strange, I was using var_dump(mb_detect_encoding("é")) to check the encoding and it was showing UTF-8. — Robert Sinclair, Feb 01 '17 at 22:27
This is a Unicode "Combining sequence": `65` is `e` "LATIN SMALL LETTER E" (U+0065) and `cc81` is `́` "COMBINING ACUTE ACCENT (U+0301)" — Mark Baker, Feb 01 '17 at 22:29
thank you, do you know if there's a way to automatically convert 65cc81 to c3a9 and then back into a single character string. This probably sounds dumb, but you get the point. — Robert Sinclair, Feb 01 '17 at 22:34

score 3 · Accepted Answer · answered Feb 01 '17 at 22:36

Your character é is actually 0x65cc81, rather than the more usual single Unicode codepoint in UTF-8 0xc3a9 (é LATIN SMALL LETTER E WITH ACUTE (U+00E9)). 0x65cc81 is a Unicode "Combining sequence": 0x65 is e "LATIN SMALL LETTER E" (U+0065) and 0xcc81 is ́ "COMBINING ACUTE ACCENT (U+0301)".

You can convert from the combining sequence to the single codepoint using PHP's Normalizer:

function strhex($string) {
  $hexstr = unpack('H*', $string);
  return array_shift($hexstr);
}

$character = "é";
var_dump($character);
var_dump(strhex($character));

$character = Normalizer::normalize($character);

var_dump($character);
var_dump(strhex($character));

gives

string(3) "é"
string(6) "65cc81"
string(2) "é"
string(4) "c3a9"

This works beautifully, thank you for your time Mark! and we can run this on whole strings that contain special characters not just single characters — Robert Sinclair, Feb 01 '17 at 22:40
For those reading this just use Mark's $character = Normalizer::normalize($character); and it will convert them to normal single code point — Robert Sinclair, Feb 01 '17 at 23:07

PHP outputs a strange french character | é = string(3)

1 Answers1