Remove non-alphanumeric characters (including ß, Ê, etc.) from a string

Question

Is there an easy way to remove all non alphanumeric characters from a string in PHP that wouldn't require listing them all individually in a regex function?

I have been using preg_replace("/[^a-zA-Z0-9\s\'\-]/", "", $my_string);in the past but this filters out important characters like ÀÈÌÒÙß etc.

I need to sanitize a name field, so monetary and mathematical characters/symbols are not needed.

What makes those characters you listed important but not [`Þ`](http://en.wikipedia.org/wiki/Thorn_%28letter%29)? (for example) Or a whole bunch of others. If you're going to allow Mu, why not Pi, or the rest of the Greek alphabet? And if you allow Yen, why not Pound and Dollar? I guess the question is where do you draw the line - what characters do you want to exclude, and why? what's special about those characters that doesn't apply to `µ`? — Spudley, Sep 01 '11 at 14:20
Technically speaking, there are a few code points that have the `\p{alphabetic}` property that are neither `\pL` nor `\pN`, most people are content with using `[\pL\pN]` for alphanumerics, especially since PHP doesn’t appear to support the `\p{alphabetic}` property required by [UTS#18 RL1.2](http://unicode.org/reports/tr18/#Compatibility_Properties) on Compatibility Properties. — tchrist, Sep 01 '11 at 15:35
You're right @Spudley. I modified the question to clarify as it made no sense to allow mathematical or monetary symbols while sanitizing a name field. — Citricguy, Sep 01 '11 at 23:11
possible duplicate of [Remove non-alphanumeric characters](http://stackoverflow.com/questions/659025/remove-non-alphanumeric-characters) — trejder, Jan 15 '15 at 13:05

Jürgen Thelen · Accepted Answer · 2011-09-01T14:33:41.543

6

Like this:

preg_replace('/[^\p{L}\p{N}\s]/u', '', $my_string);

As arnaud576875 already mentioned, you should be aware that the pattern is treated as UTF-8 when using the u modifier like I did. Relevant excerpt of the appropriate manual page:

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

edited Sep 01 '11 at 14:33

answered Sep 01 '11 at 14:17

Jürgen Thelen

12,745
7
52
71

please note that $my_string has to be in utf8 – Arnaud Le Blanc Sep 01 '11 at 14:18
@arnaud576875: I thought the `u` modifier (PCRE8) clearly indicates that the pattern is treated as UTF-8 then, but you are right. Will update my answer with an appropriate ref. – Jürgen Thelen Sep 01 '11 at 14:27
Anyway to keep this in effect, but exclude underscores from being removed? – Nathan Dec 20 '16 at 12:21

Toto · Answer 2 · 2011-09-01T14:52:36.320

1

Use unicode category :

preg_replace("/[^\pL\pN\p{Zs}'-]/u", "", $my_string);

edited Sep 01 '11 at 14:52

answered Sep 01 '11 at 14:18

Toto

89,455
62
89
125

You need to add the `/u` modifier, too. It tells the regex engine the the target string is UTF-8, which is how knows the Unicode property escapes will work. – Alan Moore Sep 01 '11 at 14:43
1

@Alan: According to the PCRE docs, you should be able to fix that by using the `PCRE_UCP` option when the pattern is compiled, or by embedding the `(*UCP)` option in the pattern. I find PHP dodgy as hell about what build options of PCRE it uses. For example, [this PHP regex tester page](http://lumadis.be/regex/test_regex.php) will have `\w` matching normal Unicode without doing anything extra. Try matching *“El niño had a fine café.”* with `#\w+#i` and you get each of those words just fine, all without `/u` or `(*UCP)`. PHP seems really screwy because this is all completely unpredictable. – tchrist Sep 01 '11 at 15:49

Remove non-alphanumeric characters (including ß, Ê, etc.) from a string

2 Answers2

Linked