How to convert a Unicode string to HTML entities? (HEX
not decimal)
For example, convert Français
to Français
.
How to convert a Unicode string to HTML entities? (HEX
not decimal)
For example, convert Français
to Français
.
For the missing hex-encoding in the related question:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
This is similar to @Baba's answer using UTF-32BE and then unpack
and vsprintf
for the formatting needs.
If you prefer iconv
over mb_convert_encoding
, it's similar:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
I find this string manipulation a bit more clear then in Get hexcode of html entities.
Your string looks like UCS-4
encoding you can try
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
Output
string 'Français' (length=13)
Firstly, when I faced this problem recently, I solved it by making sure my code-files, DB connection, and DB tables were all UTF-8 Then, simply echoing the text works. If you must escape the output from the DB use htmlspecialchars()
and not htmlentities()
so that the UTF-8 symbols are left alone and not attempted to be escaped.
Would like to document an alternative solution because it solved a similar problem for me.
I was using PHP's utf8_encode()
to escape 'special' characters.
I wanted to convert them into HTML entities for display, I wrote this code because I wanted to avoid iconv or such functions as far as possible since not all environments necessarily have them (do correct me if it is not so!)
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
Hope this helps somebody in need :-)
See How to get the character from unicode code point in PHP? for some code that allows you to do the following :
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
You can also use mb_encode_numericentity
which is supported by PHP 4.0.6+ (link to PHP doc).
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
In this way it is also possible to indicate which ranges of characters to convert into hexadecimal entities and which ones to preserve as characters.
Usage example:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
Result:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);
This is solution like @hakre (Nov 8, 2012 at 0:35) but to html entity names:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Obóz wędrowny Koła"
//while @hakre/@Baba both codes:
// => $output: "Obóz wędrowny Koła"
But always is problem with encountered not proper UTF-8, i.e.:
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Obóz wędrowny Koła - - okładka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")
but here
// => $output: (empty)
also with @hakre code... :(
It was hard to find out the cause, the only solution I know (maybe does anyone know a simpler one? please):
function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
/* Processing string against not UTF-8 chars... */
$output = ''; // New - repaired
for ($i=0; $i<strlen($input); $i++) {
if (($char = $input[$i])<"\x80") {
$output .= $char;
} else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
$j = 0; // how many chars more in UTF-8
$char = ord($char);
do { // checking first UTF-8 code char bits
$char = ($char << 1) % 0x100;
$j++;
} while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
$k = $i+1;
if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next bytes for valid UTF-8 codes
}
if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) { // Not UTF-8
$output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
} else { // UTF=8 !
$output .= substr($input, $i, 1+$j);
$i += $j;
}
}
}
return utf_entities($output); // recursively after repairing
}
return $output;
}
I.e.:
echo utf_entities("o\xC5\x82a - k\xB3a"); // oła - k³a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oñ¸¸¸¸¸a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// o񸸸a - k³a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o񸸸a - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// oña¸¸a - invalid UTF-8, fixed
An alternative that builds on the ideas in some of the other answers here but doesn't rely on mbstring or iconv. (Entities are in decimal, but that can be changed to hex easily enough by adding a call to bin2hex
before returning, and of course adding an 'x' to the string. If that's a requirement for you; it wasn't for me when I found this question.)
/**
* Convert all non-ascii unicode (utf-8) characters in a string to their HTML entity equivalent.
*
* Only UTF-8 is supported, as we don't have access to mbstring.
*/
function unicode2html($string) {
return preg_replace_callback('/[^\x00-\x7F]/u', function($matches){
// Adapted from https://www.php.net/manual/en/function.ord.php#109812
$offset = 0;
$code = ord(substr($matches[0], $offset,1));
if ($code >= 128) { //otherwise 0xxxxxxx
if ($code < 224) $bytesnumber = 2; //110xxxxx
else if ($code < 240) $bytesnumber = 3; //1110xxxx
else if ($code < 248) $bytesnumber = 4; //11110xxx
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($matches[0], $offset, 1)) - 128; //10xxxxxx
$codetemp = $codetemp*64 + $code2;
}
$code = $codetemp;
}
return "&#$code;";
}, $string);
}