1

Let's say a user submits a comment and I want to obtain the array of Unicode code points of its value, select what code points are invalid and discard them, and save the comment. How can I do that?

e.g.

The user submits "hello", and I want to obtain an array $codepoints with the following values:

$codepoints[0] = 0068
$codepoints[1] = 0065
$codepoints[2] = 006C
$codepoints[3] = 006C
$codepoints[4] = 006F

And, for some strange reason, I don't want to allow the letter "l", so I want to discard the characters with the code point U+006C. So the saved comment would be "heo". Is this even possible?

Thanks in advance!

alex
  • 479,566
  • 201
  • 878
  • 984
  • 2
    See http://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string – JW. Aug 24 '11 at 03:17
  • better to use mb_convert_encoding if you have multibyte installed. Code points are what you get after decoding UTF-8 or UTF-16LE/BE. Code points are generally represent by a 24bit integer, but most systems for speed, use 32bit integers to represent a code point. Characters can be one or more code points, depending on marks on the glyph. – Rahly Aug 24 '11 at 04:42

1 Answers1

2

Here's an example with unicode literals.

mb_internal_encoding('utf-8');
mb_regex_encoding('utf-8');
echo mb_ereg_replace('[•]', '', '•T•e•s•t•');

This will output the string Test.

If you'd rather write the code points in hex, this answer may be useful.

Community
  • 1
  • 1
Michael Mior
  • 28,107
  • 9
  • 89
  • 113