0

The PHP library lacks a mb_ord() function... That is, something that do what ord() function do, but for UTF8 (or "mb" multibyte, so "mb_ord"). I used some clues from here,

 $ord = hexdec( bin2hex($utf8char) ); //decimal 

and I suppose that mb_substr($text, $i, 1, 'UTF-8') gets "1 utf8-char"... But $ord not returns the values that we expect.

CONTEXT

This code not works: not shows code like 177 (plusmn).

 $msg = '';
 $text = "... a UTF-8 long text... Ą ⨌ 2.5±0.1; 0.5±0.2 ...";
 $allOrds = array(); 
 for($i=0; $i<mb_strlen($text, 'UTF-8'); $i++) {
    $utf8char = mb_substr($text, $i, 1,  'UTF-8'); // 1=1 unicode character?
    $ord = hexdec( bin2hex($utf8char) ); //decimal 
    if ($ord>126) { //non-ASCII
      if (isset($allOrds[$ord])) $allOrds[$ord]++; else $allOrds[$ord]=1;
    }
 }
 foreach($allOrds as $o=>$n)
    $msg.="\n entity #$o occurs $n times";
 echo $msg;

OUTPUT

entity #50308 occurs 1 times
entity #14854284 occurs 1 times
entity #49841 occurs 2 times

So (see entities table), 49841 is not 177, and 14854284 (iiiint) is not 10764.

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304

1 Answers1

1

something that do what ord() function do, but for UTF8

For that you'd first need to define what exactly that is. ord gives you the numerical value of a byte. This is often confused as "value of the character", but since encodings are a complex topic that makes no sense. So, ord == numerical value of a byte. What would you expect the "MB version of ord" to do then exactly?

Anyway, what you're getting is the numeric value of two (or more) bytes. Say, the character "漢" in UTF-8 is encoded as the three bytes E6 BC A2. That's what bin2hex gives you. hexdec then translates that to decimal, which is a pretty large number. That number has absolutely nothing to do with the Unicode code point 6F22, which you're really after. That is because the UTF-8 encoding needs a few more extra bytes to encode this code point, hence U+6F22 (漢) does not translate into the bytes 6F 22.

You have already linked to another question which does what you want:

list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));

This essentially uses the same logic, but bases it on the UCS-4 encoding, in which code points happen to correspond to bytes quite nicely.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • The `mb_ord($c)` returns a "numerical value of a multibyte (utf symbol)", and I think it is a lack of PHP. – Peter Krauss Oct 09 '13 at 12:28
  • What does "numerical value of multibyte" mean? Do you want a hypothetical `mb_ord` to return you the *Unicode code point*? What about `mb_ord($str)` where `$str` is Shift-JIS encoded (or some other non-Unicode encoding)? – deceze Oct 09 '13 at 12:30
  • thanks your solution works fine! I used [the same](http://stackoverflow.com/a/10333307/287948) at first implementation, but not worked at my loop... was my bug. – Peter Krauss Oct 09 '13 at 12:30
  • About semantic discussion: perhaps the better definition and more precise and formal term for "numerical value of a multibyte symbol" is at W3C "numerical entity" definition... The entity was encoded into the UTF8 character, so the translation is from "character encoded entity" (see [table](http://dev.w3.org/html5/html-author/charref) for utf8 decimal codes example) to its decimal value... It is obvious and intuitive from this point of view. – Peter Krauss Oct 09 '13 at 12:36
  • And that numeric value is based on the Unicode table. While I certainly agree it would be helpful to have a simple built-in function to get the Unicode codepoint of any particular character, this is just as arbitrary a classification as anything else. – deceze Oct 09 '13 at 12:40