How can I convert HTML character references (ף) to regular UTF-8?

Question

I have some hebrew websites that contains character references like: נוף

I can only view these letters if I save the file as .html and view in UTF-8 encoding.

If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.

I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (ו)

Any ideas if this is UTF-16 or any other kind of UTF representation of letters?

How can I convert it to normal letters if possible?

Using latest PHP version.

i want to know how to convert it to regular utf-8 and i wanna know what are these characters? is this the representation of utf-16 or is it something else ? — ufk, Aug 25 '10 at 12:31

score 6 · Accepted Answer · edited May 23 '17 at 12:13

Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (&#n;) or hexadecimal (&#xn;) notation.

You can use html_entity_decode that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like <, >, & will also get decoded:

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

If you just want to decode the numeric character references, you can use this:

function html_dereference($match) {
    if (strtolower($match[1][0]) === 'x') {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);

As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.

There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.

Now here’s an extension to the function mentioned above that handles this quirk:

function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
    $deref = function($match) use ($encoding, $fixMappingBug) {
        if (strtolower($match[1][0]) === "x") {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
        if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
            $mapping = array(
                8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
            $codepoint = $mapping[$codepoint-130];
        }
        return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
    };
    return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
}

If anonymous functions are not available (introduced with 5.3.0), you could also use create_function:

$deref = create_function('$match', '
    $encoding = '.var_export($encoding, true).';
    $fixMappingBug = '.var_export($fixMappingBug, true).';
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
        $mapping = array(
            8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
            338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
            8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
        $codepoint = $mapping[$codepoint-130];
    }
    return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
');

Here’s another function that tries to comply to the behavior of HTML 5:

function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
    $deref = function($match) use ($flags, $charset) {
        if ($match[1][0] === '#') {
            if (strtolower($match[1][0]) === '#') {
                $codepoint = intval(substr($match[1], 2), 16);
            } else {
                $codepoint = intval(substr($match[1], 1), 10);
            }

            // HTML 5 specific behavior
            // @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

            // handle Windows-1252 mismapping
            // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
            // @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
            $overrides = array(
                0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
                0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
                0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
                0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
                0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
                0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
            if (isset($windows1252Mapping[$codepoint])) {
                $codepoint = $windows1252Mapping[$codepoint];
            }

            if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
                $codepoint = 0xFFFD;
            }
            if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
                ($codepoint >= 0x000E && $codepoint <= 0x001F) ||
                ($codepoint >= 0x007F && $codepoint <= 0x009F) ||
                ($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
                in_array($codepoint, array(
                    0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
                    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
                    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
                    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
                $codepoint = 0xFFFD;
            }
            return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
        } else {
            return html_entity_decode($match[0], $flags, $charset);
        }   
    };
    return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
}

I’ve also noticed that in PHP 5.4.0 the html_entity_decode function was added another flag named ENT_HTML5 for HTML 5 behavior.

Any particular reason for using `mb_convert_encoding` instead of `iconv`? — ircmaxell, Aug 25 '10 at 13:05
ou can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system. — RobertPitt, Aug 25 '10 at 13:23
fair enough. It's not bad, I was just more curious as to the choice... +1 — ircmaxell, Aug 25 '10 at 13:29
what about the microsoft windows character references like ? ;p — Timo Huovinen, Feb 09 '12 at 13:54
Alohci [pointed out to me](http://stackoverflow.com/questions/9210473/convert-0-9-and-xa-fa-f0-9-references-to-utf-8-equvalents#comment11597206_9210614) that "The character override mapping is formally specified in the HTML5 spec here: [http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides](http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides)". Does your updated function match that? — thirtydot, Feb 09 '12 at 14:47
@Gumbo works now after I added the `use ($fixMappingBug,$encoding)` fix, thanks! — Timo Huovinen, Feb 09 '12 at 18:02
@Gumbo thank you for keeping this updated, but does it support the full list of html4 references unlike `html_entity_decode`? (I think `´`, `—`,`–` were not supported, not sure) — Timo Huovinen, Nov 17 '13 at 10:33

ircmaxell · Answer 2 · 2010-08-25T12:54:44.583

5

Those are XML Character References. You want to decode them using html_entity_decode():

$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

For more information, you can search Google for the entity in question. See these few examples:

edited Aug 25 '10 at 12:54

answered Aug 25 '10 at 12:42

ircmaxell

163,128
34
264
314

Those are *not* entities, not even entity references. Those are just character references. – Gumbo Aug 25 '10 at 12:44
@Gumbo: Fair enough. They are not using the named entity... But the concept is nearly identical (except that no map is needed). I'll edit the answer to reflect that... – ircmaxell Aug 25 '10 at 12:54

How can I convert HTML character references (ף) to regular UTF-8?

2 Answers2

Linked