0

how can i replace all non word characters (utf-8) in a string ?

for ASCII:

$url = preg_replace("/\W+/", " ", $url);

is there any equivalent for UTF-8 ?

UFO
  • 400
  • 2
  • 6
  • 16

2 Answers2

2

Use unicode properties:

$url = preg_replace("/[^\p{L}\p{N}_]+/u", " ", $url);

\p{L} stands for any letter
\p{N} stands for any number.

Toto
  • 89,455
  • 62
  • 89
  • 125
  • its return this: �-�-�-�-�-�-�-�-�-�-�-�-�-�-�-�-�- for $url = preg_replace("/[^\p{L}\p{N}_]+/", "-", $url); – UFO Feb 26 '14 at 18:04
  • must be "/[^\p{L}\p{N}_]+/u" , but i am not sure whats difference between these two answer. – UFO Feb 26 '14 at 18:21
  • @UFO: Yes, I forgot the `/u` modifier that deals with unicode. – Toto Feb 26 '14 at 18:27
2

You can use the Xwd character class that contains letters, digits and underscore:

$url = preg_replace('~\P{Xwd}+~u', ' ', $url);

If you don't want the underscore, you can use Xan

\p{Xwd} (Perl word character) is a predefined character class and \P{Xwd} is the negation of this class.

The u modifier means that the string must be treated as an unicode string.

equivalence:

\p{Xan}        <=>     [\p{L}\p{N}]
\p{Xwd}        <=>     [\p{Xan}_]
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125