33

Is it possible to input a character and get the unicode value back? for example, i can put &#12103 in html to output "⽇", is it possible to give that character as an argument to a function and get the number as an output without building a unicode table?

$val = someFunction("⽇");//returns 12103

or the reverse?

$val2 = someOtherFunction(12103);//returns "⽇"

I would like to be able to output the actual characters to the page not the codes, and I would also like to be able to get the code from the character if possible. The closest I got to what I want is php.net/manual/en/function.mb-decode-numericentity.php but I cant get it working, is this the code I need or am I on the wrong track?

Totoro
  • 1,234
  • 2
  • 12
  • 21

5 Answers5

39
function _uniord($c) {
    if (ord($c[0]) >=0 && ord($c[0]) <= 127)
        return ord($c[0]);
    if (ord($c[0]) >= 192 && ord($c[0]) <= 223)
        return (ord($c[0])-192)*64 + (ord($c[1])-128);
    if (ord($c[0]) >= 224 && ord($c[0]) <= 239)
        return (ord($c[0])-224)*4096 + (ord($c[1])-128)*64 + (ord($c[2])-128);
    if (ord($c[0]) >= 240 && ord($c[0]) <= 247)
        return (ord($c[0])-240)*262144 + (ord($c[1])-128)*4096 + (ord($c[2])-128)*64 + (ord($c[3])-128);
    if (ord($c[0]) >= 248 && ord($c[0]) <= 251)
        return (ord($c[0])-248)*16777216 + (ord($c[1])-128)*262144 + (ord($c[2])-128)*4096 + (ord($c[3])-128)*64 + (ord($c[4])-128);
    if (ord($c[0]) >= 252 && ord($c[0]) <= 253)
        return (ord($c[0])-252)*1073741824 + (ord($c[1])-128)*16777216 + (ord($c[2])-128)*262144 + (ord($c[3])-128)*4096 + (ord($c[4])-128)*64 + (ord($c[5])-128);
    if (ord($c[0]) >= 254 && ord($c[0]) <= 255)    //  error
        return FALSE;
    return 0;
}   //  function _uniord()

and

function _unichr($o) {
    if (function_exists('mb_convert_encoding')) {
        return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
    } else {
        return chr(intval($o));
    }
}   // function _unichr()
gturri
  • 13,807
  • 9
  • 40
  • 57
Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Hi Mark, Thanks for the code. Is this from somewhere online with an explanation on how it works? – Totoro Feb 20 '12 at 13:18
  • It's code I use in PHPExcel; but I can't recall where I got it from now, or find a reference to its source... but it's used in a number of libraries – Mark Baker Feb 20 '12 at 13:31
  • 1
    The first function takes a string (a Unicode character consists of several octets), checks the first bits of the first octet to find out the length of the character in octets (I think it's using UTF8). Then strips the control bits from every octet, and turns the remaining bits (those forming the unicode character itself) into the number you want. That conversion is straightforward, just turning the integer to string. – Sebastián Grignoli Feb 20 '12 at 13:36
  • You are a lifesaver!! Thank you! – Sangar82 Apr 10 '18 at 07:48
26

Here's a more compact implementation of unichr/uniord based on pack:

// code point to UTF-8 string
function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

// UTF-8 string to code point
function uniord($s) {
    return unpack('V', iconv('UTF-8', 'UCS-4LE', $s))[1];
}
bobince
  • 528,062
  • 107
  • 651
  • 834
20

If you're using PHP7.2 (or later), you don't need to define a new function. There are two functions for your purposes from Multibyte String extension!

To get code point of a character (i.e. Unicode value), use mb_ord(); and to get a specific character from that value, use mb_chr().

E.g.:

mb_chr(12103, "utf8"); // ⽇
mb_ord("⽇", "utf8"); // 12103
MAChitgarha
  • 3,728
  • 2
  • 33
  • 40
10

This also works, (for someone who understands bitshifting this might be more readable than Mark Bakers answer):

public function ordinal($str){
    $charString = mb_substr($str, 0, 1, 'utf-8');
    $size = strlen($charString);        
    $ordinal = ord($charString[0]) & (0xFF >> $size);
    //Merge other characters into the value
    for($i = 1; $i < $size; $i++){
        $ordinal = $ordinal << 6 | (ord($charString[$i]) & 127);
    }
    return $ordinal;
}
user23127
  • 827
  • 10
  • 21
  • Hello, I tested your answer vs Marks and I think there is an issue with yours (because I am not good with bit shifting I dont know what). echo "

    ".ordinal("響")." :: "._uniord("響")."

    "; Returns: 105 :: 38911 (it should be 38911)
    – Totoro May 05 '14 at 09:34
  • Hello, thank you for the response. The error seems to be in the default encoding mb_internal_encoding(), if that is not 'utf-8' retrieving the first character fails. I have fixed this by explicitly adding the encoding to mb_substr. – user23127 May 05 '14 at 10:18
  • I up voted as it works now, but will leave the answer as it was. Thanks for the alternative – Totoro May 05 '14 at 14:35
  • Sure, I don't really answer for karma :P. – user23127 May 05 '14 at 14:42
3

You can use the following functions

For encoding

string utf8_encode ( string $data )

http://php.net/manual/en/function.utf8-encode.php

For decoding

string utf8_decode ( string $data )

http://php.net/manual/en/function.utf8-decode.php

Also check

http://php.net/manual/en/function.htmlspecialchars.php

<?php


echo htmlspecialchars_decode("&#12103");//will print ⽇

?>
Akhil Thayyil
  • 9,263
  • 6
  • 34
  • 48
  • 1
    hello Akhil, I have looked at these but they only work with the ascii range characters, anything above that becomes gibberish. – Totoro Feb 20 '12 at 13:08
  • hello @Akhil, thanks, this works, shame there is no encode option. – Totoro Feb 20 '12 at 13:58
  • UTF-8 is a Unicode encoding, not Unicode. utf8_decode does not give me the unicode value of the character I pass it (what the question asked for). The question asked about `12103` specifically, where `utf8_encode` and `utf8_decode` both return the same number(/string) that it was passed instead of a unicode character. – Kissaki Jan 15 '16 at 21:11