Regex for removing special characters on a multilingual string

Question

The most common regex suggested for removing special characters seems to be this -

preg_replace( '/[^a-zA-Z0-9]/', '', $string );

The problem is that it also removes non-English characters.

Is there a regex that removes special characters on all languages? Or the only solution is to explicitly match each special character and remove them?

Casimir et Hippolyte · Accepted Answer · 2014-04-29T18:36:39.580

5

You can use instead:

preg_replace('/\P{Xan}+/u', '', $string );

\p{Xan} is all that is a number or a letter in any alphabet of the unicode table.
\P{Xan} is all that is not a number or a letter. It is a shortcut for [^\p{Xan}]

edited Apr 29 '14 at 18:36

answered Apr 29 '14 at 18:15

Casimir et Hippolyte

88,009
5
94
125

Thanks! I understand that `\P` is a character without Unicode property. Can you please explain `{Xan}`. – A.Jesin Apr 29 '14 at 18:19
1

@A.Jesin: the uppercase P is only used to negate an unicode character class. For example `\p{Latin}` is a character class for all latin letters (like `[a-zA-Z]` but with accents), If you want to negate it to obtain all that is not a latin letter, you write `\P{Latin}` – Casimir et Hippolyte Apr 29 '14 at 18:28
3

@A.Jesin: You can find all unicode character classes in this document: http://pcre.org/pcre.txt – Casimir et Hippolyte Apr 29 '14 at 18:30

score 3 · Answer 2 · answered Apr 29 '14 at 18:17

3

You can use:

$string = preg_replace( '/[^\p{L}\p{N}]+/u', '', $string );

answered Apr 29 '14 at 18:17

anubhava

761,203
64
569
643

Regex for removing special characters on a multilingual string

2 Answers2

Linked