6

How to convert a Unicode string to HTML entities? (HEX not decimal)

For example, convert Français to Français.

mrdaliri
  • 7,148
  • 22
  • 73
  • 107
  • 1
    What do you need this for? It *should* never be necessary.... – Pekka Nov 07 '12 at 23:47
  • 4
    It depends on which unicode encoding in specific. [`mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');`](http://stackoverflow.com/a/11310258/367456) for example works for UTF-8 unicode strings in PHP. If you *need* hex encodings the linked answer shows you how to capture all those (from utf-8 strings) you only need to run your hex encoding. – hakre Nov 07 '12 at 23:56
  • @hakre: string is `UTF-8`. `mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');` convert to decimal, but I want `hex` code. – mrdaliri Nov 08 '12 at 00:04
  • 1
    Your question is not very precise. I think if I take it right, the output is `Français` and not `Français`. – hakre Nov 08 '12 at 00:15
  • possible duplicate of [Get hexcode of html entities](http://stackoverflow.com/questions/7482977/get-hexcode-of-html-entities) – hakre Nov 08 '12 at 00:17
  • 2
    @Pekka웃 - I've just found a vendor API in 2015 that requires plain US-ASCII XML requests to process a Unicode-related feature. *sigh* – Álvaro González Apr 24 '15 at 07:48
  • @ÁlvaroG.Vicario argh!!! – Pekka Apr 24 '15 at 07:56

7 Answers7

11

For the missing hex-encoding in the related question:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

This is similar to @Baba's answer using UTF-32BE and then unpack and vsprintf for the formatting needs.

If you prefer iconv over mb_convert_encoding, it's similar:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $binary = iconv('UTF-8', 'UTF-32BE', $utf8);
    $entity = vsprintf('&#x%X;', unpack('N', $binary));
    return $entity;
}, $input);

I find this string manipulation a bit more clear then in Get hexcode of html entities.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Splendid! I use this to code back CKEditors output converting my html entities to unicode symbols. – Daniel Mar 26 '16 at 12:45
  • This helped me display emojis on a ISO-8859-1 website. First I convert to hex using this approach, then I can save it in the db and display it in both the website, and the webview in an app. Very nice. – Jette Aug 26 '17 at 21:03
8

Your string looks like UCS-4 encoding you can try

$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
    $char = current($m);
    $utf = iconv('UTF-8', 'UCS-4', $char);
    return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);

Output

string 'Français' (length=13)
Baba
  • 94,024
  • 28
  • 166
  • 217
5

Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars() and not htmlentities() so that the UTF-8 symbols are left alone and not attempted to be escaped.

Would like to document an alternative solution because it solved a similar problem for me. I was using PHP's utf8_encode() to escape 'special' characters.

I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)

function unicode2html($string) {
    return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}

$foo = 'This is my test string \u03b50';
echo unicode2html($foo);

Hope this helps somebody in need :-)

msi
  • 65
  • 1
  • 7
Angad
  • 2,803
  • 3
  • 32
  • 45
0

See How to get the character from unicode code point in PHP? for some code that allows you to do the following :

Example use :

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

Output :

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC int
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
Community
  • 1
  • 1
John Slegers
  • 45,213
  • 22
  • 199
  • 169
0

You can also use mb_encode_numericentity which is supported by PHP 4.0.6+ (link to PHP doc).

function unicode2html($value) {
    return mb_encode_numericentity($value, [
    //  start codepoint
    //  |       end codepoint
    //  |       |       offset
    //  |       |       |       mask
        0x0000, 0x001F, 0x0000, 0xFFFF,
        0x0021, 0x002C, 0x0000, 0xFFFF,
        0x002E, 0x002F, 0x0000, 0xFFFF,
        0x003C, 0x003C, 0x0000, 0xFFFF,
        0x003E, 0x003E, 0x0000, 0xFFFF,
        0x0060, 0x0060, 0x0000, 0xFFFF,
        0x0080, 0xFFFF, 0x0000, 0xFFFF
    ], 'UTF-8', true);
}

In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.

Usage example:

$input = array(
    '"Meno più, PIÙ o meno"',
    '\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
    '<script>alert("XSS");</script>',
    '"`'
);

$output = array();
foreach ($input as $str)
    $output[] = unicode2html($str)

Result:

$output = array(
    '&#x22;Meno pi&#xF9;&#x2C; PI&#xD9; o meno&#x22;',
    '&#x27;&#xC0;&#xCC;&#xD9;&#xD2;L&#xC8; PERCH&#xC9; perch&#xE9; &#xE8; sempre cos&#xEC; non si s&#xE0;&#x27;',
    '&#x3C;script&#x3E;alert&#x28;&#x22;XSS&#x22;&#x29;;&#x3C;&#x2F;script&#x3E;',
    '&#x22;&#x60;'
);
Marco Sacchi
  • 712
  • 6
  • 21
0

This is solution like @hakre (Nov 8, 2012 at 0:35) but to html entity names:

$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
    if ($char[0]!=='&' || (strlen($char)<2)) {
        $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
        $char = vsprintf('&#x%X;', unpack('N', $binary));
    } // (else $char is "&entity;", which is better)
    return $char;
}, $input);

$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Ob&oacute;z w&eogon;drowny Ko&lstrok;a"
//while @hakre/@Baba both codes:
// => $output: "Ob&#xF3;z w&#x119;drowny Ko&#x142;a"

But always is problem with encountered not proper UTF-8, i.e.:

$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Ob&oacute;z w&eogon;drowny Ko&lstrok;a -  - ok&lstrok;adka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")

but here

// => $output: (empty)

also with @hakre code... :(

It was hard to find out the cause, the only solution I know (maybe does anyone know a simpler one? please):

function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
    list($utf8) = $match;
    $char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
    if ($char[0]!=='&' || (strlen($char)<2)) {
        $binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
        $char = vsprintf('&#x%X;', unpack('N', $binary));
    } // (else $char is "&entity;", which is better)
    return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
    
    /* Processing string against not UTF-8 chars... */
    $output = ''; // New - repaired
    for ($i=0; $i<strlen($input); $i++) {
        if (($char = $input[$i])<"\x80") {
                $output .= $char;
            } else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
                $j = 0; // how many chars more in UTF-8
                $char = ord($char);
                do { // checking first UTF-8 code char bits
                    $char = ($char << 1) % 0x100;
                    $j++;
                } while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
                $k = $i+1;
                if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
                    for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next  bytes for valid UTF-8 codes
                }
                if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) {    // Not UTF-8
                    $output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
                } else { // UTF=8 !
                    $output .= substr($input, $i, 1+$j);
                    $i += $j;
                }
            }
    }
    return utf_entities($output); // recursively after repairing
}
return $output;
}

I.e.:

echo utf_entities("o\xC5\x82a - k\xB3a"); // o&lstrok;a - k&#xb3;a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o&#xf1;&#xb8;&#xb8;&#xb8;&#xb8;&#xb8;a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// o&#x78E38;a - k&#xb3;a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o&#x78E38;a - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// o&#xf1;a&#xb8;&#xb8;a - invalid UTF-8, fixed
msegit
  • 110
  • 8
0

An alternative that builds on the ideas in some of the other answers here but doesn't rely on mbstring or iconv. (Entities are in decimal, but that can be changed to hex easily enough by adding a call to bin2hex before returning, and of course adding an 'x' to the string. If that's a requirement for you; it wasn't for me when I found this question.)

/** 
 * Convert all non-ascii unicode (utf-8) characters in a string to their HTML entity equivalent.
 * 
 * Only UTF-8 is supported, as we don't have access to mbstring.
 */
function unicode2html($string) {
    return preg_replace_callback('/[^\x00-\x7F]/u', function($matches){
        // Adapted from https://www.php.net/manual/en/function.ord.php#109812
        $offset = 0;
        $code = ord(substr($matches[0], $offset,1));
        if ($code >= 128) {        //otherwise 0xxxxxxx
            if ($code < 224) $bytesnumber = 2;                //110xxxxx
            else if ($code < 240) $bytesnumber = 3;        //1110xxxx
            else if ($code < 248) $bytesnumber = 4;    //11110xxx
            $codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
            for ($i = 2; $i <= $bytesnumber; $i++) {
                $offset ++;
                $code2 = ord(substr($matches[0], $offset, 1)) - 128;        //10xxxxxx
                $codetemp = $codetemp*64 + $code2;
            }
            $code = $codetemp;
        }
        return "&#$code;";
    }, $string);
}
DMJ
  • 722
  • 4
  • 20