How can I deal with Unicode code points?

Question

Let's say a user submits a comment and I want to obtain the array of Unicode code points of its value, select what code points are invalid and discard them, and save the comment. How can I do that?

e.g.

The user submits "hello", and I want to obtain an array $codepoints with the following values:

$codepoints[0] = 0068
$codepoints[1] = 0065
$codepoints[2] = 006C
$codepoints[3] = 006C
$codepoints[4] = 006F

And, for some strange reason, I don't want to allow the letter "l", so I want to discard the characters with the code point U+006C. So the saved comment would be "heo". Is this even possible?

Thanks in advance!

See http://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string — JW., Aug 24 '11 at 03:17
better to use mb_convert_encoding if you have multibyte installed. Code points are what you get after decoding UTF-8 or UTF-16LE/BE. Code points are generally represent by a 24bit integer, but most systems for speed, use 32bit integers to represent a code point. Characters can be one or more code points, depending on marks on the glyph. — Rahly, Aug 24 '11 at 04:42

score 2 · Answer 1 · edited May 23 '17 at 12:04

2

Here's an example with unicode literals.

mb_internal_encoding('utf-8');
mb_regex_encoding('utf-8');
echo mb_ereg_replace('[•]', '', '•T•e•s•t•');

This will output the string Test.

If you'd rather write the code points in hex, this answer may be useful.

edited May 23 '17 at 12:04

Community

1
1

answered Aug 24 '11 at 05:21

Michael Mior

28,107
9
89
113

How can I deal with Unicode code points?

1 Answers1