6

After quite a bit of searching and testing, the simplest method I've found for a Unicode-compatible alternative to the PHP ord() function is this:

$utf8Character = 'Ą';
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
echo $ord; # 260

I found this here. However, it has been mentioned that this method is rather slow. Does anyone know of a more efficient method which is nearly as simple? And what does UCS-4BE mean?

John Slegers
  • 45,213
  • 22
  • 199
  • 169
David Jones
  • 10,117
  • 28
  • 91
  • 139
  • That's actually... pretty damn simple. – Ignacio Vazquez-Abrams Jul 03 '12 at 04:46
  • Sorry, I wasn't clear. See updated post... – David Jones Jul 03 '12 at 04:49
  • Any other routine would have to do basically the same thing, since PHP isn't as strong at Unicode as other languages. – Ignacio Vazquez-Abrams Jul 03 '12 at 04:55
  • Okay, sounds good. I just hate not knowing what's going on. Like what is UCS-4BE and why is it so important to convert it to UCS-4BE? – David Jones Jul 03 '12 at 05:00
  • 1
    ASCII has a single (simple) number-to-character mapping. Unicode has [several](http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#In_detail) (of which UTF-8 is only one). UCS-4BE is perhaps the one with the least amount of confusing quirks. – tylerl Jul 03 '12 at 05:10
  • http://www.joelonsoftware.com/articles/Unicode.html – Ignacio Vazquez-Abrams Jul 03 '12 at 05:13
  • Great resource @IgnacioVazquez-Abrams. Thanks! – David Jones Jul 03 '12 at 05:56
  • You should define what that "Unicode ord" is supposed to do. We know what ord does, but what result do you expect for, say, `mb_ord('漢')`? – deceze Jul 03 '12 at 06:36
  • It should simply return the Unicode point (as an integer) for the associated character. I tested this and it worked great: `list(, $ord) = unpack('N', mb_convert_encoding(mb_substr('漢', 0, 1, 'UTF-8'), 'UCS-4BE', 'UTF-8')); echo $ord.' ';`. This returns `28450` which is the correct code point: http://unicodelookup.com/#漢/1 – David Jones Jul 03 '12 at 07:04
  • The best Unicode alternative to `PHP ord()` is to look for another language (sorry, I couldn't resist). – leonbloy Jul 04 '12 at 01:54
  • @leonbloy: Haha. Yeah I'll be moving to Python soon, but I wasn't ready to throw away my PHP code... – David Jones Jul 05 '12 at 16:34

3 Answers3

4

You might also be able to implement this function using iconv(), but the mb_convert_encoding method you've got looks reasonable to me. Just make sure that $utf8Character is a single character, not a long string, and it'll perform reasonably well.

UCS-4BE is a Unicode encoding which stores each character as a 32-bit (4 byte) integer. This accounts for the "UCS-4"; the "BE" prefix indicates that the integers are stored in big-endian order. The reason for this encoding is that, unlike smaller encodings (like UTF-8 or UTF-16), it requires no surrogate pairs -- each character is a fixed size.

  • Aha, that makes sense! So we grab all 4 bytes (instead of just one or two as with UTF-8 and UTF-16, respectively) and unpack it as an integer. Got it. Thanks! – David Jones Jul 03 '12 at 05:14
  • 1
    Right -- although I'll clarify that, with UTF-8, a single Unicode codepoint can require anywhere between one and four bytes; with UTF-16, that becomes two or four. –  Jul 03 '12 at 05:28
4

I just wrote a polyfill for missing multibyte versions of ord and chr with the following in mind:

  • It defines functions mb_ord and mb_chr only if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.

  • It uses the widely used mbstring extension to do the conversion. If the mbstring extension is not loaded, it will use the iconv extension instead.

I also added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions


Code :

if (!function_exists('codepoint_encode')) {
    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }
}

if (!function_exists('codepoint_decode')) {
    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }
}

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_htmlentities')) {
    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }
}

if (!function_exists('mb_html_entity_decode')) {
    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }
}

How to use :

echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));

Output :

Get string from numeric DEC value
string(4) "ď"
string(2) "ď"

Get string from numeric HEX value
string(4) "ď"
string(2) "ď"

Get numeric value of character as DEC string
int(50319)
int(271)

Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"

Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"

Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
John Slegers
  • 45,213
  • 22
  • 199
  • 169
0

Here's my string to int conversion using that formula. You could also explode the string and use array_reduce to sum it up.

/**
 * @param $string
 * @param int $index
 * @return mixed
 */
function convertEncoding($string, $index = 0, $carryResult = 0)
{
    $remainder = mb_strlen(mb_substr($string, $index));
    while ($remainder) {
        $currentCharacter = $string[$index];
        list(, $ord) = unpack('N', mb_convert_encoding($currentCharacter, 'UCS-4BE', 'UTF-8'));
        return $this->convertEncoding($string, $index += 1, $ord += $carryResult);
    }
    return $carryResult;
}
Michael Ryan Soileau
  • 1,763
  • 17
  • 28