0

When PHP outputs file names from an FTP folder it produces French characters which are 3 characters long, so when we var_dump:

var_dump("é");

It shows:

string(3)

But the actual character should be

string(2)

The file names are pulled using a Wordpress function

When it's string(3) we can't do a preg_match on it to replace it with a standard ASCII character.

I tried declaring the formatting as UTF-8, but it's already UTF-8. Also tried

header('Content-Type: text/html; charset=iso-8859-1');

But the result is garbled text.

Is there anything else we can try? What kind of a character is it?

Robert Sinclair
  • 4,550
  • 2
  • 44
  • 46
  • 1
    This sounds very similar to http://stackoverflow.com/a/1725329/7496329 – Andy Feb 01 '17 at 21:43
  • 1
    Make sure you wrap the filename in the echo with htmlspecialchars - http://php.net/manual/en/function.htmlspecialchars.php – Brogan Feb 01 '17 at 21:48
  • 2
    `string(1)` is a 1-byte string; UTF-8 is a __multi-byte__ character set, so 1 character does not equate to 1 byte – Mark Baker Feb 01 '17 at 21:50
  • 1
    Possible duplicate of [preg\_match and UTF-8 in PHP](http://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php) – miken32 Feb 01 '17 at 21:52
  • @MarkBaker You're right it should be string(2), but for some reason I'm getting string(3) which doesn't work with preg_match, I'm reading the other suggested replies to see if I can use them. Thank you for the comments everyone – Robert Sinclair Feb 01 '17 at 21:59
  • 1
    It doesn't look as though you have a UTF-8 `é`: [take a look at the differences between your character and a UTF-8 `é`](https://3v4l.org/FGSrb) – Mark Baker Feb 01 '17 at 22:23
  • @MarkBaker wow thank you, i'm definitely keeping this function for future use. Do you know if there's an automatic way to convert these é to their UTF-8 equivalents? Strange, I was using var_dump(mb_detect_encoding("é")) to check the encoding and it was showing UTF-8. – Robert Sinclair Feb 01 '17 at 22:27
  • 1
    This is a Unicode "Combining sequence": `65` is `e` "LATIN SMALL LETTER E" (U+0065) and `cc81` is `́` "COMBINING ACUTE ACCENT (U+0301)" – Mark Baker Feb 01 '17 at 22:29
  • thank you, do you know if there's a way to automatically convert 65cc81 to c3a9 and then back into a single character string. This probably sounds dumb, but you get the point. – Robert Sinclair Feb 01 '17 at 22:34
  • 1
    Just posted an answer to that – Mark Baker Feb 01 '17 at 22:36

1 Answers1

3

Your character is actually 0x65cc81, rather than the more usual single Unicode codepoint in UTF-8 0xc3a9 (é LATIN SMALL LETTER E WITH ACUTE (U+00E9)). 0x65cc81 is a Unicode "Combining sequence": 0x65 is e "LATIN SMALL LETTER E" (U+0065) and 0xcc81 is ́ "COMBINING ACUTE ACCENT (U+0301)".

You can convert from the combining sequence to the single codepoint using PHP's Normalizer:

function strhex($string) {
  $hexstr = unpack('H*', $string);
  return array_shift($hexstr);
}

$character = "é";
var_dump($character);
var_dump(strhex($character));

$character = Normalizer::normalize($character);

var_dump($character);
var_dump(strhex($character));

gives

string(3) "é"
string(6) "65cc81"
string(2) "é"
string(4) "c3a9"
Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • This works beautifully, thank you for your time Mark! and we can run this on whole strings that contain special characters not just single characters – Robert Sinclair Feb 01 '17 at 22:40
  • For those reading this just use Mark's $character = Normalizer::normalize($character); and it will convert them to normal single code point – Robert Sinclair Feb 01 '17 at 23:07