6

Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).

EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

function strToHex ($string)
{
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
    {
        $id = ord (mb_substr ($string, $i, 1, "utf-8"));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}

    return ($hex);
}

Any ideas?

EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

Adrien Hingert
  • 1,416
  • 5
  • 26
  • 51
  • 4
    Change the title to something more descriptive - UTF-8 **is** Unicode. You probably looking for "UTF-8 to Unicode Code Points." – Artyom Aug 18 '11 at 11:18
  • A useful resource: http://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string – Karolis Aug 18 '11 at 11:29
  • How do you define "non special characters"? – borrible Aug 18 '11 at 11:54
  • 1
    No, you can’t convert UTF‐8 to Unicode except in the pathological case through the identity operation. Define “no special characters” and "normal letters and numbers! Are characters like "%" and "/" special? What about Control‐C? What makes a letter or number normal or abnormal? Are *ñ* U+00F1 and *ð* U+00F0 normal letters? What is *ñ* is really n followed by by U+0303? For that matter, what makes a character a letter or number? Aren’t ¼ U+00BC and ² U+00B2 numbers? Unicode 6.0.0 has 100,520 GC=Letter and 1,100 GC=Number code points, of which 456 are GC=Letter_Number like Ⅷ. (*continued*...) – tchrist Aug 18 '11 at 13:10
  • And that’s not all. What about the symbols in the `{Enclosed_Alphanumerics}` block, like Ⓚ U+24C0? That’s an Other_Symbol, but it has both an upper‐ and a lowercase. Is that normal enough to be a letter in your book? What about Other_Symbols like ™ U+2122, which have a compatibility decomposition that is simply "TM"? Is ㎎ U+338E ok but ㎍ U+338D not ok simply because you are prejudiced against Greek over Latin? How do pretend to convert these to whatever you figment of normality may be? – tchrist Aug 18 '11 at 13:14
  • "non special characters" would be in the range 32 to 126 of the ASCII table – Adrien Hingert Aug 18 '11 at 13:48
  • 4
    Adrien: That definition would never have occurred to me. That means of Unicode’s 1,114,112 code points, merely 94 of them are **not** specials, leaving 1,114,018 of them to be classified as “specials”? That’s really counterintuitive. I claim that the ones that occur **five orders of magnitude** less frequently than the rest are the special ones. Otherwise you’ve turned the idea of specialness on its head. From my perspective, it’s actually code points 32–126 that are special, not which are non‐special. Can’t see calling 99.99% of something “special”. As I said, would never have occurred to me. – tchrist Aug 18 '11 at 14:14

8 Answers8

28

For a readable-form I would go with JSON. It's not required to escape non-ASCII characters in JSON, but PHP does:

echo json_encode("tchüß");

"tch\u00fc\u00df"
bobince
  • 528,062
  • 107
  • 651
  • 834
11

With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:

var_dump(sprintf('U+%04X', IntlChar::ord('ß')));

# Outputs: string(6) "U+00DF"
François
  • 1,831
  • 20
  • 33
  • 1
    Note that you need extension=php_intl.dll enabled in PHP.ini for this class to be present. – eis May 28 '17 at 16:08
10

For people looking to find the Unicode Code Point for any character this might be useful. You can then encode the string in whatever you want, replacing certain characters with escape codes, and leaving others in their binary form (eg. ascii printable characters), depending on the context in which you want to use it.

From: Mapping codepoints to Unicode encoding forms

The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself.

/**
 * Convert a string into an array of decimal Unicode code points.
 *
 * @param $string   [string] The string to convert to codepoints
 * @param $encoding [string] The encoding of $string
 * 
 * @return [array] Array of decimal codepoints for every character of $string
 */
function toCodePoint( $string, $encoding )
{
    $utf32  = mb_convert_encoding( $string, 'UTF-32', $encoding );
    $length = mb_strlen( $utf32, 'UTF-32' );
    $result = [];


    for( $i = 0; $i < $length; ++$i )

        $result[] = hexdec( bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ) );


    return $result;
}
  • I needed to get the codepoint values for a UTF-8 string to check if a given TTF fonts supports them and this function worked perfectly to get the codepoint values. – Erik Kalkoken Oct 16 '18 at 13:31
3

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

Gigala
  • 143
  • 2
  • 10
Luwe
  • 3,026
  • 1
  • 20
  • 21
  • `htmlentities` only converts characters for which there are entities defined in the HTML language, though, which only covers a small subset of Unicode. Unfortunately it does not create `...;` character references for other characters. – bobince Aug 18 '11 at 12:56
  • I'm aware, but also `iconv` tends to give some problems. Not all characters seem to get perfectly converted for every character set. That's why I mentioned the `htmlentities` function. It was also suggested in the comments on the `iconv` function page: http://nl.php.net/manual/en/function.iconv.php#81494 – Luwe Aug 18 '11 at 13:04
2

Tested on php 5.6

/**
 * @param string $utf8char
 * @return string
 */
function toUnicodeCodePoint($utf8char)
{
    return 'U+' . dechex(mb_ord($utf8char));
}

/**
 * @see https://github.com/symfony/polyfill-mbstring
 * @param string $s
 * @return int
 */
function mb_ord($s)
{
    $code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
    if (0xF0 <= $code) {
        return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
    }
    if (0xE0 <= $code) {
        return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
    }
    if (0xC0 <= $code) {
        return (($code - 0xC0) << 6) + $s[2] - 0x80;
    }

    return $code;
}

echo toUnicodeCodePoint('');
// U+1f613
Garlaro
  • 149
  • 3
  • 7
2

I guess you're going to print out your strings on a website?

I'm storing all my databases in uft8, using html_entities($string) before output.

Maybe you have to try html_entities(utf8_encode($string));

skywise
  • 49
  • 5
2

I once created a function called _convert() which encodes safely everything to UTF-8.

powtac
  • 40,542
  • 28
  • 115
  • 170
0

I had a problem when i need to convert string (utf-8 in default) with cyrilic to entities partly - only cyrilic. Finaly i need to get JSON-like result, like this:

<li class="my_class">City - Mocsow (Москва)</li>

to this:

<li class=\"my_class\">City - Mocsow (\u041c\u043e\u0441\u043a\u0432\u0430)<\/li>

So, i`ve got a compex (mix of subj. author and Nus) solution:

function strToHex($string){
    $enc="utf-8";
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, $enc); $i++){
        $id = ord (mb_substr ($string, $i, 1, $enc));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, $enc) : toCodePoint(mb_substr ($string, $i, 1, $enc), $enc);
    }
    return $hex;
}
function toCodePoint($string, $encoding){
    $utf32  = mb_convert_encoding( $string, 'UTF-32', $encoding );
    $length = mb_strlen( $utf32, 'UTF-32' );
    $result = Array();
    for( $i = 0; $i < $length; ++$i )$result[] = "\u".substr(bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ), 4,8);
    return implode("", $result);
}
$output=strToHex(
    str_replace( // this is for json compatible
        array("\"", "\n", "\r", "\t", "/"),
        array('\"', '\n', "", " ", "\/"),
        $text
    )
);
echo $output;

It tested on php 5.2.17 :)

user989840
  • 179
  • 3
  • 9