how can i replace all non word characters (utf-8) in a string ?
for ASCII:
$url = preg_replace("/\W+/", " ", $url);
is there any equivalent for UTF-8 ?
Use unicode properties:
$url = preg_replace("/[^\p{L}\p{N}_]+/u", " ", $url);
\p{L}
stands for any letter
\p{N}
stands for any number.
You can use the Xwd character class that contains letters, digits and underscore:
$url = preg_replace('~\P{Xwd}+~u', ' ', $url);
If you don't want the underscore, you can use Xan
\p{Xwd}
(Perl word character) is a predefined character class and \P{Xwd}
is the negation of this class.
The u
modifier means that the string must be treated as an unicode string.
equivalence:
\p{Xan} <=> [\p{L}\p{N}]
\p{Xwd} <=> [\p{Xan}_]