UTF-8 to Unicode Code Points

Question

Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).

EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

function strToHex ($string)
{
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
    {
        $id = ord (mb_substr ($string, $i, 1, "utf-8"));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}

    return ($hex);
}

Any ideas?

EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

Change the title to something more descriptive - UTF-8 **is** Unicode. You probably looking for "UTF-8 to Unicode Code Points." — Artyom, Aug 18 '11 at 11:18
A useful resource: http://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string — Karolis, Aug 18 '11 at 11:29
No, you can’t convert UTF‐8 to Unicode except in the pathological case through the identity operation. Define “no special characters” and "normal letters and numbers! Are characters like "%" and "/" special? What about Control‐C? What makes a letter or number normal or abnormal? Are *ñ* U+00F1 and *ð* U+00F0 normal letters? What is *ñ* is really n followed by by U+0303? For that matter, what makes a character a letter or number? Aren’t ¼ U+00BC and ² U+00B2 numbers? Unicode 6.0.0 has 100,520 GC=Letter and 1,100 GC=Number code points, of which 456 are GC=Letter_Number like Ⅷ. (*continued*...) — tchrist, Aug 18 '11 at 13:10
And that’s not all. What about the symbols in the `{Enclosed_Alphanumerics}` block, like Ⓚ U+24C0? That’s an Other_Symbol, but it has both an upper‐ and a lowercase. Is that normal enough to be a letter in your book? What about Other_Symbols like ™ U+2122, which have a compatibility decomposition that is simply "TM"? Is ㎎ U+338E ok but ㎍ U+338D not ok simply because you are prejudiced against Greek over Latin? How do pretend to convert these to whatever you figment of normality may be? — tchrist, Aug 18 '11 at 13:14
"non special characters" would be in the range 32 to 126 of the ASCII table — Adrien Hingert, Aug 18 '11 at 13:48
Adrien: That definition would never have occurred to me. That means of Unicode’s 1,114,112 code points, merely 94 of them are **not** specials, leaving 1,114,018 of them to be classified as “specials”? That’s really counterintuitive. I claim that the ones that occur **five orders of magnitude** less frequently than the rest are the special ones. Otherwise you’ve turned the idea of specialness on its head. From my perspective, it’s actually code points 32–126 that are special, not which are non‐special. Can’t see calling 99.99% of something “special”. As I said, would never have occurred to me. — tchrist, Aug 18 '11 at 14:14

score 28 · Answer 1 · answered Aug 18 '11 at 12:54

28

For a readable-form I would go with JSON. It's not required to escape non-ASCII characters in JSON, but PHP does:

echo json_encode("tchüß");

"tch\u00fc\u00df"

answered Aug 18 '11 at 12:54

bobince

528,062
107
651
834

1

Interesting, never thought of this! – Adrien Hingert Aug 18 '11 at 13:49
1

Brilliant! Works like a charm.. :) – Anthony Feb 10 '13 at 02:48
JSON requires, by default, the escaping of non-ASCII characters. And you should do it every time. – William R Apr 05 '18 at 04:59
Great solution! – Basster Jun 20 '18 at 13:26
@WilliamR, why do you think so? JSON is by definition UTF-8, which is fully Unicode-capable. Escaping anything that is Unicode is not necessary. – Ulrich Eckhardt Oct 11 '18 at 12:06
Well, this is obvious to use UTF-8 for JSON. But escaping unicodes by ASCII ("é" comes \u00e9) is a good way to protect your data against a bad "charset" set in the headers of a HTTP transmission or over badly programmed code or even worse, a JSON inside a CDATA tag in a ISO-Latin1 XML file. – William R Oct 24 '18 at 21:29

score 11 · Answer 2 · answered Feb 04 '16 at 22:56

11

With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:

var_dump(sprintf('U+%04X', IntlChar::ord('ß')));

# Outputs: string(6) "U+00DF"

answered Feb 04 '16 at 22:56

François

1,831
20
33

1

Note that you need extension=php_intl.dll enabled in PHP.ini for this class to be present. – eis May 28 '17 at 16:08

score 10 · Answer 3 · 2016-11-19T16:03:58.423

For people looking to find the Unicode Code Point for any character this might be useful. You can then encode the string in whatever you want, replacing certain characters with escape codes, and leaving others in their binary form (eg. ascii printable characters), depending on the context in which you want to use it.

From: Mapping codepoints to Unicode encoding forms

The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself.

/**
 * Convert a string into an array of decimal Unicode code points.
 *
 * @param $string   [string] The string to convert to codepoints
 * @param $encoding [string] The encoding of $string
 * 
 * @return [array] Array of decimal codepoints for every character of $string
 */
function toCodePoint( $string, $encoding )
{
    $utf32  = mb_convert_encoding( $string, 'UTF-32', $encoding );
    $length = mb_strlen( $utf32, 'UTF-32' );
    $result = [];


    for( $i = 0; $i < $length; ++$i )

        $result[] = hexdec( bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ) );


    return $result;
}

I needed to get the codepoint values for a UTF-8 string to check if a given TTF fonts supports them and this function worked perfectly to get the codepoint values. — Erik Kalkoken, Oct 16 '18 at 13:31

score 3 · Accepted Answer · edited Jul 10 '13 at 06:30

3

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

edited Jul 10 '13 at 06:30

Gigala

143
2
10

answered Aug 18 '11 at 11:17

Luwe

3,026
1
20
21

`htmlentities` only converts characters for which there are entities defined in the HTML language, though, which only covers a small subset of Unicode. Unfortunately it does not create `...;` character references for other characters. – bobince Aug 18 '11 at 12:56
I'm aware, but also `iconv` tends to give some problems. Not all characters seem to get perfectly converted for every character set. That's why I mentioned the `htmlentities` function. It was also suggested in the comments on the `iconv` function page: http://nl.php.net/manual/en/function.iconv.php#81494 – Luwe Aug 18 '11 at 13:04

score 2 · Answer 5 · answered Jul 11 '17 at 06:43

Tested on php 5.6

/**
 * @param string $utf8char
 * @return string
 */
function toUnicodeCodePoint($utf8char)
{
    return 'U+' . dechex(mb_ord($utf8char));
}

/**
 * @see https://github.com/symfony/polyfill-mbstring
 * @param string $s
 * @return int
 */
function mb_ord($s)
{
    $code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
    if (0xF0 <= $code) {
        return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
    }
    if (0xE0 <= $code) {
        return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
    }
    if (0xC0 <= $code) {
        return (($code - 0xC0) << 6) + $s[2] - 0x80;
    }

    return $code;
}

echo toUnicodeCodePoint('');
// U+1f613

score 2 · Answer 6 · answered Aug 18 '11 at 11:16

2

I guess you're going to print out your strings on a website?

I'm storing all my databases in uft8, using html_entities($string) before output.

Maybe you have to try html_entities(utf8_encode($string));

answered Aug 18 '11 at 11:16

skywise

49
5

score 2 · Answer 7 · answered Aug 18 '11 at 11:29

2

I once created a function called _convert() which encodes safely everything to UTF-8.

answered Aug 18 '11 at 11:29

powtac

40,542
28
115
170

1

you could add the answer here, and not as a link. – eis May 28 '17 at 16:15

score 0 · Answer 8 · answered Nov 16 '15 at 23:08

I had a problem when i need to convert string (utf-8 in default) with cyrilic to entities partly - only cyrilic. Finaly i need to get JSON-like result, like this:

<li class="my_class">City - Mocsow (Москва)</li>

to this:

<li class=\"my_class\">City - Mocsow (\u041c\u043e\u0441\u043a\u0432\u0430)<\/li>

So, i`ve got a compex (mix of subj. author and Nus) solution:

function strToHex($string){
    $enc="utf-8";
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, $enc); $i++){
        $id = ord (mb_substr ($string, $i, 1, $enc));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, $enc) : toCodePoint(mb_substr ($string, $i, 1, $enc), $enc);
    }
    return $hex;
}
function toCodePoint($string, $encoding){
    $utf32  = mb_convert_encoding( $string, 'UTF-32', $encoding );
    $length = mb_strlen( $utf32, 'UTF-32' );
    $result = Array();
    for( $i = 0; $i < $length; ++$i )$result[] = "\u".substr(bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ), 4,8);
    return implode("", $result);
}
$output=strToHex(
    str_replace( // this is for json compatible
        array("\"", "\n", "\r", "\t", "/"),
        array('\"', '\n', "", " ", "\/"),
        $text
    )
);
echo $output;

It tested on php 5.2.17 :)

UTF-8 to Unicode Code Points

8 Answers8

Linked

Related