7

I want to replace these chars [^a-zа-з0-9_] with null, but I can't do it when its multibyte string.

I tried with mb_*, iconv, PCRE, mb_eregi_replace and u modifier (for PCRE), but none of them worked well.

The mb_eregi_replace works, but it only outputs the correct utf8 string, but it doesn't replace the characters, when preg_replace works with the same regex..

Here is my code that works with unicode, but it doesn't replace text.

function _data($data)
{
  mb_regex_encoding('UTF-8');
  return mb_eregi_replace('/[^a-zа-з0-9_]+/', '', $data);
}

var_dump(namespace\_data('Текст Removethis- and this _#$)( and also this $*@&$'));

and the result is with the special chars (#_$..) when it should replace them, if I change the function to preg_replace (and no unicode) it should replace them.

Alex Emilov
  • 1,243
  • 4
  • 19
  • 25
  • 2
    `a-з` looks a bit weird. is that a cyrillic `a` and not a regular ascii `a`? if it's ascii, you've got one heckuva range of characters specified there. – Marc B Oct 12 '11 at 17:10

1 Answers1

16

As long as your input string is UTF-8 encoded (test if not or re-encode it to UTF-8), you can safely use preg_replace if you use the correct regular expression with the u (PCRE_UTF8) modifier (the is the lower-case U at the end):

function _data($data)
{ 
  return preg_replace('/[^\w_]+/u', '', $data);
}

var_dump(namespace\_data('Текст Removethis- and this _#$)( and also this $*@&$'));

Demo

  • \w = any word character
  • u (at then end) = enable UTF-8 for the regex.
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Well, with me this doesn't work.If а-з (a cyrillic char) is in $data the return is NULL; – Alex Emilov Oct 12 '11 at 18:22
  • In my browser that are three characters (code-points). Do you mean a character range or a specific character? Can you give the unicode number of the character(s) you're having problems with? – hakre Oct 12 '11 at 18:26
  • 1
    Hm, without the u modifier it works, strange. /[\W]+/ is perfect – Alex Emilov Oct 12 '11 at 18:28
  • 1
    Ensure the string you pass is UTF-8 encoded when you use the u modifier. If it work w/o that's a sign that the PCRE library on the system you execute the code is locale aware and knows about your language. However, that code is not portable. With another locale (e.g. on a differently configured system), it may fail. So better look for a UTF-8 solution, as this is more stable and works everywhere where UTF-8 is available, which is really common today. – hakre Oct 12 '11 at 18:30
  • Well, var_dump(mb_detect_encoding($data)); returns UTF-8 – Alex Emilov Oct 12 '11 at 18:37
  • Technically it's impossible to "detect" encodings. You must know, it's meta-information next to the bytes. If you can give the unicode-codepoints of the character(s) in question, I would be able to actually try to reproduce your problem. – hakre Oct 12 '11 at 18:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/4206/discussion-between-alex-emilov-and-hakre) – Alex Emilov Oct 12 '11 at 18:47