2

Is there an easy way to remove all non alphanumeric characters from a string in PHP that wouldn't require listing them all individually in a regex function?

I have been using preg_replace("/[^a-zA-Z0-9\s\'\-]/", "", $my_string);in the past but this filters out important characters like ÀÈÌÒÙß etc.

I need to sanitize a name field, so monetary and mathematical characters/symbols are not needed.

trejder
  • 17,148
  • 27
  • 124
  • 216
Citricguy
  • 412
  • 7
  • 21
  • What makes those characters you listed important but not [`Þ`](http://en.wikipedia.org/wiki/Thorn_%28letter%29)? (for example) Or a whole bunch of others. If you're going to allow Mu, why not Pi, or the rest of the Greek alphabet? And if you allow Yen, why not Pound and Dollar? I guess the question is where do you draw the line - what characters do you want to exclude, and why? what's special about those characters that doesn't apply to `µ`? – Spudley Sep 01 '11 at 14:20
  • Technically speaking, there are a few code points that have the `\p{alphabetic}` property that are neither `\pL` nor `\pN`, most people are content with using `[\pL\pN]` for alphanumerics, especially since PHP doesn’t appear to support the `\p{alphabetic}` property required by [UTS#18 RL1.2](http://unicode.org/reports/tr18/#Compatibility_Properties) on Compatibility Properties. – tchrist Sep 01 '11 at 15:35
  • You're right @Spudley. I modified the question to clarify as it made no sense to allow mathematical or monetary symbols while sanitizing a name field. – Citricguy Sep 01 '11 at 23:11
  • possible duplicate of [Remove non-alphanumeric characters](http://stackoverflow.com/questions/659025/remove-non-alphanumeric-characters) – trejder Jan 15 '15 at 13:05

2 Answers2

6

Like this:

preg_replace('/[^\p{L}\p{N}\s]/u', '', $my_string);

As arnaud576875 already mentioned, you should be aware that the pattern is treated as UTF-8 when using the u modifier like I did. Relevant excerpt of the appropriate manual page:

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Jürgen Thelen
  • 12,745
  • 7
  • 52
  • 71
1

Use unicode category :

preg_replace("/[^\pL\pN\p{Zs}'-]/u", "", $my_string);
Toto
  • 89,455
  • 62
  • 89
  • 125
  • You need to add the `/u` modifier, too. It tells the regex engine the the target string is UTF-8, which is how knows the Unicode property escapes will work. – Alan Moore Sep 01 '11 at 14:43
  • 1
    @Alan: According to the PCRE docs, you should be able to fix that by using the `PCRE_UCP` option when the pattern is compiled, or by embedding the `(*UCP)` option in the pattern. I find PHP dodgy as hell about what build options of PCRE it uses. For example, [this PHP regex tester page](http://lumadis.be/regex/test_regex.php) will have `\w` matching normal Unicode without doing anything extra. Try matching *“El niño had a fine café.”* with `#\w+#i` and you get each of those words just fine, all without `/u` or `(*UCP)`. PHP seems really screwy because this is all completely unpredictable. – tchrist Sep 01 '11 at 15:49