1

I am using utf8 format to store all my data into mysql. Before data is inserted into the database I need to clean the strings with unwanted characters. The strings are in utf8 format. I know how to use regex and string replace but do not know how to work with arabic characters.

Sample string that needs to be cleaned : "████ .. الــقــوانين الجديـــدة في قســـم الـعنايـ";

Thanking you

Imran Omar Bukhsh
  • 7,849
  • 12
  • 59
  • 81
  • 2
    Are there any wanted Arabic characters? Unicode allocates a few ranges to Arabic; can you leverage off those ranges? – Jonathan Leffler Jul 10 '11 at 16:38
  • @Jonathan Leffler - i meant 'unwanted characters' . Fixed. Yes but how do I get the unicode of a letter in a string before I check if it lies in a range? – Imran Omar Bukhsh Jul 10 '11 at 16:45
  • @Imran: You don't need it. Just specify the range you don't want. The regex could then be (in words): *replace all characters in this range by an empty string*. The regex engine figures out whether the character is in the range or not. – Felix Kling Jul 10 '11 at 16:49
  • @Felix Kling - so how to I specify range of unicode? example would be nice, I can easily find the range. – Imran Omar Bukhsh Jul 10 '11 at 16:50
  • @Imran: `[\x{FFFF}-\x{FFFF}]` should do it and you have to set the `u` modifier to turn on unicode. See here for more information: http://www.regular-expressions.info/php.html – Felix Kling Jul 10 '11 at 16:52
  • @Felix Kling - echo preg_replace('[\x{0600}-\x{06FF}]','',$string); ? – Imran Omar Bukhsh Jul 10 '11 at 16:57
  • @Felix Kling - Compilation failed: character value in \x{...} sequence is too large at offset 7 – Imran Omar Bukhsh Jul 10 '11 at 16:58
  • @Imran: Well, you have to pass a proper formated expression. Try: `preg_replace('/[\x{0600}-\x{06FF}]/u','',$string)`. – Felix Kling Jul 10 '11 at 17:08
  • @Felix Kling - wow good it removed the arabic, how do I not the 'not' expression so I can have it the other way round? – Imran Omar Bukhsh Jul 10 '11 at 17:26
  • @Felix Kling - ok preg_replace('/[^\x{0600}-\x{06FF}]/u','',$string); works. You can put your answer below so I can mark it as the right answer. – Imran Omar Bukhsh Jul 10 '11 at 17:40
  • 1
    @Felix: take your comments and make them into an answer. Note that there is an 'Arabic Supplemental' range U+0750..U+077F, an Arabic Presentation Forms A in the range U+FB50..U+FC3F, and Arabic Presentation Forms B in the range U+FE70..U+FEFC. You can find the information at [Unicode Charts](http://www.unicode.org/charts/). I had the files downloaded from playing with this a year or two ago. – Jonathan Leffler Jul 10 '11 at 17:41
  • @Jonathan: Actually I thought *you* should create an answer because you started with unicode ranges ;) – Felix Kling Jul 10 '11 at 17:57
  • @Jonathan Leffler , @Felix Kling - i thought i should be the one to decide – Imran Omar Bukhsh Jul 10 '11 at 18:25
  • @Imran: Note that I added a `+` after the the character class so that it matches *one or more characters*. This is a better than replacing every single character. – Felix Kling Jul 10 '11 at 18:46

1 Answers1

4

Ok. As @Jonathan Leffler already said, if you can specify the unicode character ranges for the characters that need to be replaced, you can use a regular expression to replace the characters with an empty string.

A unicode character is specified as \x{FFFF} in an expression (in PHP). In addition, you have to set the u modifier to make PHP treat the pattern as UTF8.

So in the end, you have something like this:

preg_replace('/[\x{FFFF}-\x{FFFF}]+/u','',$string);

where

  • /.../u are the delimiters plus the modifier
  • [...]+ is a character class plus quantifier, which means match any of these characters inside one or mor times
  • \x{FFFF}-\x{FFFF} is a unicode character range (obviously you have to provide the right codepoints/numbers of the characters).

You can also negate the group with a ^ you can specify the range which you want to keep:

preg_replace('/[^\x{FFFF}-\x{FFFF}]+/u','',$string);

More information:

Community
  • 1
  • 1
Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143