UTF-8 safe equivalent of ord or charCodeAt() in PHP

Question

I need to be able to use ord() to get the same value as javascript's charCodeAt() function. The problem is that ord() doesn't support UTF8.

How can I get Ą to translate to 260 in PHP? I've tried some uniord functions out there, but they all report 256 instead of 260.

Thanks a lot for any help!

Regards

@bardiir Yeah I realised that moments after posting. – alex Apr 26 '12 at 12:13 — alex, Apr 26 '12 at 12:13
Sorry, should have been more clear. PHP – Rila Apr 26 '12 at 12:14 — Rila, Apr 26 '12 at 12:14

masakielastic · Answer 1 · 2013-08-30T20:10:29.573

mbstring version:

function utf8_char_code_at($str, $index)
{
    $char = mb_substr($str, $index, 1, 'UTF-8');

    if (mb_check_encoding($char, 'UTF-8')) {
        $ret = mb_convert_encoding($char, 'UTF-32BE', 'UTF-8');
        return hexdec(bin2hex($ret));
    } else {
        return null;
    }
}

using htmlspecialchars and htmlspecialchars_decode for getting one character:

function utf8_char_code_at($str, $index)
{
    $char = '';
    $str_index = 0;

    $str = utf8_scrub($str);
    $len = strlen($str);

    for ($i = 0; $i < $len; $i += 1) {

        $char .= $str[$i];

        if (utf8_check_encoding($char)) {

            if ($str_index === $index) {
                return utf8_ord($char);
            }

            $char = '';
            $str_index += 1;
        }
    }

    return null;
}

function utf8_scrub($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

function utf8_check_encoding($str)
{
    return $str === utf8_scrub($str);
}

function utf8_ord($char)
{
    $lead = ord($char[0]);

    if ($lead < 0x80) {
        return $lead;
    } else if ($lead < 0xE0) {
        return (($lead & 0x1F) << 6) 
      | (ord($char[1]) & 0x3F);
    } else if ($lead < 0xF0) {
        return (($lead &  0xF) << 12)
     | ((ord($char[1]) & 0x3F) <<  6)
     |  (ord($char[2]) & 0x3F);
    } else {
        return (($lead &  0x7) << 18)
     | ((ord($char[1]) & 0x3F) << 12)
     | ((ord($char[2]) & 0x3F) <<  6)
     |  (ord($char[3]) & 0x3F);
    }
}

PHP extension version:

#include "ext/standard/html.h"
#include "ext/standard/php_smart_str.h"

const zend_function_entry utf8_string_functions[] = {
    PHP_FE(utf8_char_code_at, NULL)
    PHP_FE_END
};

PHP_FUNCTION(utf8_char_code_at)
{
    char *str;
    int len;
    long index;

    unsigned int code_point;
    long i;
    int status;
    size_t pos = 0, old_pos = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "sl", &str, &len, &index) == FAILURE) {
        return;
    }

    for (i = 0; pos < len; ++i) {
        old_pos = pos;
        code_point = php_next_utf8_char((const unsigned char *) str, (size_t) len, &pos, &status);

        if (i == index) {
            if (status == SUCCESS) {
                RETURN_LONG(code_point);
            } else {
                RETURN_NULL();
            }

        }

    }

    RETURN_NULL();
}

Wow. That is just like ***insanely*** complicated for something that should be a trivial built-in in the language proper! I’ll give you a +1 for effort, but wow, just wow! — tchrist, Aug 28 '13 at 22:29
Thanks. I added another example using htmlspecialchars and htmlspecialchars_decode. I posted for reading PHP source code and practing C language. I am considering to propose new string function for mbstring or PHP core. This function corresponds to Ruby's each_char. This function can be used for defining fallback functions such as mb_strlen and mb_substr. I implemented this function as PHP extension: http://blog.sarabande.jp/post/57645700697 (sorry for Japanse article). — masakielastic, Aug 30 '13 at 15:40

hakre · Accepted Answer · 2012-04-26T13:50:29.463

11

ord() works byte per byte (as most of PHPs standard string functions - if not all). You would need to convert it your own, for example with the help of the multibyte string extension:

$utf8Character = 'Ą';
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
echo $ord; # 260

edited Apr 26 '12 at 13:50

answered Apr 26 '12 at 12:22

hakre

193,403
52
435
836

`list` isn't a function, but a special form; `list($ord) = $someArray` is basically the same thing as `$ord = $someArray[0]`. `list` is handy when you want to assign the elements of an array to multiple variables, or to get around the fact that you can't add a subscript to an array expression that's not an actual array variable in PHP < 5.4. – Mark Reed Apr 26 '12 at 12:30
Ah, I see. But when I'm executing the code above it's not outputting anything (it's blank). Any ideas how to turn this into a home run? – Rila Apr 26 '12 at 12:32
Not sure why it's not working with `list`, but try this: `$chars = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8')); $ord = $chars[0];` – Mark Reed Apr 26 '12 at 12:34
Hmm, not working either. Looks like both unpack and list are setting the value in index 1 of the array rather than 0. Will this change depending on the number of bytes a character takes up or will this always be reliably 1? – Rila Apr 26 '12 at 12:40
mb_convert_encoding translates from UTF-8 to UCS-4BE, which gets you a 4-byte big-endian integer representation of the character code. The 'N' format causes unpack to parse a big-endian ("network format") integer and turn it into a regular PHP number. There shouldn't be any extra stuff in the result of the unpack. What if you just `print_r` the results of the unpack? Or the string 'Ą', for that matter - make sure you didn't insert an extra control char or space or something? – Mark Reed Apr 26 '12 at 12:48
1

print_r is giving me Array ( [1] => 260 ) which is rather strange that it's not zero indexed, but as long as it works I'm happy :) Thanks! – Rila Apr 26 '12 at 12:52
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/10549/discussion-between-mark-reed-and-rila) – Mark Reed Apr 26 '12 at 13:02
@Rila: That was an error when I edited the answer, was AFK in the while, corrected it now. – hakre Apr 26 '12 at 13:50
@hakre: could also use array_merge() to compress the result of unpack(). I find it incomprehensible that unpack returns a 1-based array. – Mark Reed Apr 26 '12 at 13:54

score 4 · Answer 3 · answered Apr 26 '12 at 12:23

4

Try:


function uniord($c) {
        $h = ord($c{0});
        if ($h <= 0x7F) {
            return $h;
        } else if ($h < 0xC2) {
            return false;
        } else if ($h <= 0xDF) {
            return ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
        } else if ($h <= 0xEF) {
            return ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6
                                     | (ord($c{2}) & 0x3F);
        } else if ($h <= 0xF4) {
            return ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12
                                     | (ord($c{2}) & 0x3F) << 6
                                     | (ord($c{3}) & 0x3F);
        } else {
            return false;
        }
    }
    echo uniord('Ą');

answered Apr 26 '12 at 12:23

Sudhir Bastakoti

99,167
15
158
162

Thanks Sudhir, that works! What's the source of this function? – Rila Apr 26 '12 at 12:27
well, actually i also got it from some source which currently i dont remember as i had this code since long time, sorry for that, but i hope the function will help you somve the problem – Sudhir Bastakoti Apr 26 '12 at 12:31
Implementing UTF-8 by hand is fun and all; I've done it a few times. But I think it's smarter to use libraries maintained by someone else. Especially since then you can handle other encodings as well.. – Mark Reed Apr 26 '12 at 12:36

score 0 · Answer 4 · edited Nov 16 '20 at 16:03

0

This should be the equivalent to JavaScript’s charCodeAt() based of @hakre’s work but corrected to actually work the same as JavaScript (in every way I could think of to test):

function charCodeAt($string, $offset) {
  $string = mb_substr($string, $offset, 1);
  list(, $ret) = unpack('S', mb_convert_encoding($string, 'UTF-16LE'));
  return $ret;
}

(This requires the PHP extension "mbstring" to be installed and activated.)

edited Nov 16 '20 at 16:03

Mathias Brodala

5,905
13
30

answered Jun 20 '16 at 01:18

Daniel

4,525
3
38
52

1

Change $character by $string :-) – Marcos Fernandez Ramos Mar 03 '18 at 11:39

score 0 · Answer 5 · answered Mar 26 '23 at 17:18

Since PHP 7.2 there is mb_ord(). Using this one can get an JS equivalent to charCodeAt() as

function jsCharCodeAt($string, $index)
{
     return mb_ord(mb_substr($string, $index, 1));
}

This seems to work just fine for all UTF-16 characters. However, the behavior of charCodeAt() for non UTF-16 characters is a little bit tricky and the functions are not equivalent on them.

score -1 · Answer 6 · edited May 23 '17 at 11:53

-1

There is one ord_utf8 function here : https://stackoverflow.com/a/42600959/7558876

This function looks like this (accept string and return integer)

<?php

function ord_utf8($s){
return (int) ($s=unpack('C*',$s[0].$s[1].$s[2].$s[3]))&&$s[1]<(1<<7)?$s[1]:
($s[1]>239&&$s[2]>127&&$s[3]>127&&$s[4]>127?(7&$s[1])<<18|(63&$s[2])<<12|(63&$s[3])<<6|63&$s[4]:
($s[1]>223&&$s[2]>127&&$s[3]>127?(15&$s[1])<<12|(63&$s[2])<<6|63&$s[3]:
($s[1]>193&&$s[2]>127?(31&$s[1])<<6|63&$s[2]:0)));
}

And one fast chr_utf8 here : https://stackoverflow.com/a/42510129/7558876

This function looks like this (accept integer and return a string)

<?php

function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}

Please check links if you want one example…

edited May 23 '17 at 11:53

Community

1
1

answered Mar 04 '17 at 20:43

Php'Regex

213
3
4

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](//stackoverflow.com/help/deleted-answers) – FelixSFD Mar 04 '17 at 20:44
BTW: If you think the question has an answer somewhere else on Stack Overflow, please mark it as [duplicate](http://stackoverflow.com/help/duplicates) instead of quoting the other answer. – FelixSFD Mar 04 '17 at 20:44

UTF-8 safe equivalent of ord or charCodeAt() in PHP

6 Answers6

Linked

Related