2

My website and database is set to utf-8 and utf8mb4.

On textareas it's perfectly fine when users put utf-8 symbols/emojis.

But on certain input fields (name, address etc.) I want to remove the possibility of those "funny symbols", and only deal with basic text and numbers, including danish characters æøå, accents and symbols like -_'@()?=,.:;!"#&<> etc.

How would I go about this?

Is there some native php function to strip unicode symbols/characters, or do I have to find/make a specific regex function for it?

mowgli
  • 2,796
  • 3
  • 31
  • 68
  • 1
    https://stackoverflow.com/questions/2896450/allow-only-a-za-z0-9-in-string-using-php Refer here. – Omkar Nath Singh Jul 18 '18 at 14:08
  • You might want to take a look at my question that faced a very similar issue in Java, that I believe might have similar solution in php. https://stackoverflow.com/questions/49510006/remove-and-other-such-emojis-images-signs-from-java-string – riorio Jul 18 '18 at 14:46
  • Provide the hex for a short segment of suspicious text. I may be able to decipher what the encoding is and whether it was mangled from something more legible (cf "Mojibake"). – Rick James Jul 18 '18 at 17:12
  • Do you have a more specific definition of non-funny symbol? Because `å` is as Unicode as `韻` or ``... or `a`. Perhaps you want to filter by [plane](https://en.wikipedia.org/wiki/Plane_(Unicode)) but you'd still need to determine which ones are acceptable. – Álvaro González Jul 18 '18 at 18:04
  • The Java regexp in the question linked by @OmkarNathSingh can possibly be used in `preg_replace()`. – Álvaro González Jul 18 '18 at 18:07

1 Answers1

4

There are functions for checking encoding: http://php.net/manual/en/function.mb-check-encoding.php but to strip out characters I think you would need to use regex:

function StripNonUTF($str){
  return preg_replace('/[^\pL\pM[:ascii:]]+/g', '', $str);
}
  • \pL matches any kind of letter from any language
  • \pM matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
  • [:ascii:] matches a character with ASCII value 0 through 127
Stu Care
  • 86
  • 6
  • If the string is ***not*** UTF-8, you're going to strip stuff…? That can leave your string mightily mangled if you don't know what encoding you're working with. – deceze Jul 18 '18 at 14:48
  • 2
    `mb_check_encoding — Check if the string is valid for the specified encoding`. The question has nothing to do with having malformed UTF-8 strings. – Álvaro González Jul 18 '18 at 18:06