Remove emojis / unicode chars

Question

My website and database is set to utf-8 and utf8mb4.

On textareas it's perfectly fine when users put utf-8 symbols/emojis.

But on certain input fields (name, address etc.) I want to remove the possibility of those "funny symbols", and only deal with basic text and numbers, including danish characters æøå, accents and symbols like -_'@()?=,.:;!"#&<> etc.

How would I go about this?

Is there some native php function to strip unicode symbols/characters, or do I have to find/make a specific regex function for it?

https://stackoverflow.com/questions/2896450/allow-only-a-za-z0-9-in-string-using-php Refer here. — Omkar Nath Singh, Jul 18 '18 at 14:08
You might want to take a look at my question that faced a very similar issue in Java, that I believe might have similar solution in php. https://stackoverflow.com/questions/49510006/remove-and-other-such-emojis-images-signs-from-java-string — riorio, Jul 18 '18 at 14:46
Provide the hex for a short segment of suspicious text. I may be able to decipher what the encoding is and whether it was mangled from something more legible (cf "Mojibake"). — Rick James, Jul 18 '18 at 17:12
Do you have a more specific definition of non-funny symbol? Because `å` is as Unicode as `韻` or ``... or `a`. Perhaps you want to filter by [plane](https://en.wikipedia.org/wiki/Plane_(Unicode)) but you'd still need to determine which ones are acceptable. — Álvaro González, Jul 18 '18 at 18:04
The Java regexp in the question linked by @OmkarNathSingh can possibly be used in `preg_replace()`. — Álvaro González, Jul 18 '18 at 18:07

Stu Care · Accepted Answer · 2018-07-18T14:50:14.480

4

There are functions for checking encoding: http://php.net/manual/en/function.mb-check-encoding.php but to strip out characters I think you would need to use regex:

function StripNonUTF($str){
  return preg_replace('/[^\pL\pM[:ascii:]]+/g', '', $str);
}

\pL matches any kind of letter from any language
\pM matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
[:ascii:] matches a character with ASCII value 0 through 127

edited Jul 18 '18 at 14:50

answered Jul 18 '18 at 14:29

Stu Care

86
6

If the string is ***not*** UTF-8, you're going to strip stuff…? That can leave your string mightily mangled if you don't know what encoding you're working with. – deceze Jul 18 '18 at 14:48
2

`mb_check_encoding — Check if the string is valid for the specified encoding`. The question has nothing to do with having malformed UTF-8 strings. – Álvaro González Jul 18 '18 at 18:06

Remove emojis / unicode chars

1 Answers1