14

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?

I will be working with input from many different languages and I am wondering if there is something that can help me with this

Thanks

Thomas
  • 4,641
  • 13
  • 44
  • 67

4 Answers4

26

There are the unicode character class thingys that you can use:

To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+

Well, and obviously don't forget the regex /u modifier.

mario
  • 144,265
  • 20
  • 237
  • 291
2

I used this:

$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
0

Similar post

Remove non-utf8 characters from string

I'm not sure if this covers all characters though.

According to this post on th dreamincode forum

http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/

this should work

/[^\x{21}-\x{7E}\s\t\n\r]/
Community
  • 1
  • 1
CBusBus
  • 2,321
  • 1
  • 18
  • 26
-2

Maybe this will be usefull?

$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);
Scuba Kay
  • 2,004
  • 3
  • 26
  • 48