4

I am having issues with character encoding. I have simplified it to this below script:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php

$string = 'Stan&#146;s';

echo $string.'<br><br>'; // Stan's

echo html_entity_decode($string).'<br><br>'; // Stan's

echo html_entity_decode($string, ENT_QUOTES, 'UTF-8'); // Stans

?>
</body>
</html>

I would like to make use of the last echo. However, it removes the ', why?

Update

I have tried all three options ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES and it removes the ' in all cases.

Abs
  • 56,052
  • 101
  • 275
  • 409

1 Answers1

11

The problem is that &#146; decodes to the Unicode character U+0092, UTF-8 C2 92, known as PRIVATE USE TWO:

$ php test.php | xxd
0000000: 5374 616e c292 73                        Stan..s

I.e., this doesn't decode to a usual apostrophe.

html_entity_decode($string) works because it doesn't actually decode the entity, since the default target charset is latin-1, which cannot represent this character. If you specify UTF-8 as the target charset, the entity is actually decoded.

The target of that entity is the Windows-1252 charset:

echo iconv('cp1252', 'UTF-8', html_entity_decode('Stan&#146;s', ENT_QUOTES, 'cp1252'));

Stan’s

Quoting Wikipedia:

Numeric references always refer to Unicode code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so &#153;, for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

So you're dealing with legacy HTML entities here, which PHP apparently doesn't handle the same way "some" browsers do. You may want to check if the decoded entities are in the range specified above, that you transcode/redecode them in Windows-1252, then convert them to UTF-8. Or require your users to pass valid HTML.

This function should handle both legacy and regular HTML entities:

function legacy_html_entity_decode($str, $quotes = ENT_QUOTES, $charset = 'UTF-8') {
    return preg_replace_callback('/&#(\d+);/', function ($m) use ($quotes, $charset) {
        if (0x80 <= $m[1] && $m[1] <= 0x9F) {
            return iconv('cp1252', $charset, html_entity_decode($m[0], $quotes, 'cp1252'));
        }
        return html_entity_decode($m[0], $quotes, $charset);
    }, $str);
}
deceze
  • 510,633
  • 85
  • 743
  • 889
  • Interesting so should I convert the `’` to something else then apply `html_entity_decode`? – Abs Aug 21 '11 at 11:46
  • +1. I'd like to add that using `'` instead of `’` will give you your desired result. – adlawson Aug 21 '11 at 11:55
  • @adlawson - unfortunately I don't have control of the input. @deceze - I just tried `utf8_encode(html_entity_decode($string))` and surprisingly it works! – Abs Aug 21 '11 at 11:58
  • Maybe you can add this to your answer as the suggestion of `html_entity_decode($string)` working triggered me to try and wrapping a utf8_encode around it. – Abs Aug 21 '11 at 11:59
  • 1
    @Abs Yeah, but in HTML source you should see `’`. – Karolis Aug 21 '11 at 12:01
  • @Karolis - yes in the HTMl source I see `’`. @deceze - that looks like a specific solution to this problem but I should of explained users will be passing in HTML from different operating systems and I have to handle the encoding. – Abs Aug 21 '11 at 12:05
  • @deceze - interesting, I didn't think encoding was going to be this complex! – Abs Aug 21 '11 at 12:16
  • 1
    @Abs It really wouldn't be, if everybody would follow the same rules and do it right. :o) – deceze Aug 21 '11 at 12:17
  • @deceze - that's something we can only dream of! :) – Abs Aug 21 '11 at 12:31
  • @Abs See [this comment](http://www.php.net/manual/en/function.html-entity-decode.php#76408) on the PHP manual, which seems to list a translation table for legacy entities (haven't checked too closely for accuracy). You could apply a `preg_replace_callback` on all HTML entities, translating them if necessary before decoding. – deceze Aug 21 '11 at 12:31
  • That's a perfect and concise function Deceze – georgiecasey Apr 28 '12 at 21:19