12

I want to make sure some string replacement's I'm running are multi byte safe. I've found a few mb_str_replace functions around the net but they're slow. I'm talking 20% increase after passing maybe 500-900 bytes through it.

Any recommendations? I'm thinking about using preg_replace as it's native and compiled in so it might be faster. Any thoughts would be appreciated.

Mike B
  • 31,886
  • 13
  • 87
  • 111
onassar
  • 3,313
  • 7
  • 36
  • 58
  • 1
    You need to give more info. What's the replacement string and the encoding of the subject? If the subject is UTF-8 and the replacement string is in the ASCII range, you can use `str_replace`. – Artefacto Aug 15 '10 at 23:46
  • Unicode has been around for, what, 15 years now? Still mucking with mb strings in a core inner loop? Work from the inside out. – Hans Passant Aug 16 '10 at 00:18

4 Answers4

19

As said there, str_replace is safe to use in utf-8 contexts, as long as all parameters are utf-8 valid, because it won't be any ambiguous match between both multibyte encoded strings. If you check the validity of your input, then you have no need to look for a different function.

Áxel Costas Pena
  • 5,886
  • 6
  • 28
  • 59
  • 6
    This is wrong if you are working with unicode and care about [unicode equivalence](http://en.wikipedia.org/wiki/Unicode_equivalence). In unicode several different byte sequences can represent the same character. Using `str_replace` would work **only** if you normalize both your strings first. – Qtax Jan 22 '14 at 10:52
  • Good tip, anyway my understanding of "are multibyte safe" is "they won't give any false positive while matching", what in practice means they won't corrupt the output information in terms to what it's desired for the replacement. – Áxel Costas Pena Jan 24 '14 at 21:12
  • check the provided link – Peyman Mohamadpour May 15 '17 at 06:27
  • 1
    Worth noting that UTF-8 is a proper superset of ASCII, and more importantly, multibyte UTF-8 characters will never contain ASCII octets. Therefore, if your `$search` and `$replace` only contain ASCII, you can safely use `str_replace()` on a UTF-8 subject. – jchook Dec 09 '19 at 01:38
  • 1
    Re: Qtax unicode normalization, I found the [Unicode Normalization Forms](https://unicode.org/reports/tr15/) spec easier to grok than the wikipedia page. – jchook Dec 09 '19 at 01:46
3

As encoding is a real challenge when there are inputs from everywhere (utf8 or others), I prefer using only multibyte-safe functions. For str_replace, I am using this one which is fast enough.

if (!function_exists('mb_str_replace'))
{
   function mb_str_replace($search, $replace, $subject, &$count = 0)
   {
      if (!is_array($subject))
      {
         $searches = is_array($search) ? array_values($search) : array($search);
         $replacements = is_array($replace) ? array_values($replace) : array($replace);
         $replacements = array_pad($replacements, count($searches), '');
         foreach ($searches as $key => $search)
         {
            $parts = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
         }
      }
      else
      {
         foreach ($subject as $key => $value)
         {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
         }
      }
      return $subject;
   }
}
Alain Tiemblo
  • 36,099
  • 17
  • 121
  • 153
2

Here's my implementation, based off Alain's answer:

/**
 * Replace all occurrences of the search string with the replacement string. Multibyte safe.
 *
 * @param string|array $search The value being searched for, otherwise known as the needle. An array may be used to designate multiple needles.
 * @param string|array $replace The replacement value that replaces found search values. An array may be used to designate multiple replacements.
 * @param string|array $subject The string or array being searched and replaced on, otherwise known as the haystack.
 *                              If subject is an array, then the search and replace is performed with every entry of subject, and the return value is an array as well.
 * @param string $encoding The encoding parameter is the character encoding. If it is omitted, the internal character encoding value will be used.
 * @param int $count If passed, this will be set to the number of replacements performed.
 * @return array|string
 */
public static function mbReplace($search, $replace, $subject, $encoding = 'auto', &$count=0) {
    if(!is_array($subject)) {
        $searches = is_array($search) ? array_values($search) : [$search];
        $replacements = is_array($replace) ? array_values($replace) : [$replace];
        $replacements = array_pad($replacements, count($searches), '');
        foreach($searches as $key => $search) {
            $replace = $replacements[$key];
            $search_len = mb_strlen($search, $encoding);

            $sb = [];
            while(($offset = mb_strpos($subject, $search, 0, $encoding)) !== false) {
                $sb[] = mb_substr($subject, 0, $offset, $encoding);
                $subject = mb_substr($subject, $offset + $search_len, null, $encoding);
                ++$count;
            }
            $sb[] = $subject;
            $subject = implode($replace, $sb);
        }
    } else {
        foreach($subject as $key => $value) {
            $subject[$key] = self::mbReplace($search, $replace, $value, $encoding, $count);
        }
    }
    return $subject;
}

His doesn't accept a character encoding, although I suppose you could set it via mb_regex_encoding.

My unit tests pass:

function testMbReplace() {
    $this->assertSame('bbb',Str::mbReplace('a','b','aaa','auto',$count1));
    $this->assertSame(3,$count1);
    $this->assertSame('ccc',Str::mbReplace(['a','b'],['b','c'],'aaa','auto',$count2));
    $this->assertSame(6,$count2);
    $this->assertSame("\xbf\x5c\x27",Str::mbReplace("\x27","\x5c\x27","\xbf\x27",'iso-8859-1'));
    $this->assertSame("\xbf\x27",Str::mbReplace("\x27","\x5c\x27","\xbf\x27",'gbk'));
}
Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
1

Top rated note on http://php.net/manual/en/ref.mbstring.php#109937 says str_replace works for multibyte strings.

RobC
  • 22,977
  • 20
  • 73
  • 80
Shaunak Sontakke
  • 980
  • 1
  • 7
  • 17
  • 1
    Ah - remember to check the comment in it's entirety and not just the first phrase. Quote: `Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same.` so it's only in a specific use case in a "perfect world" type example that str_replace works with multibyte strings. – Frits Oct 14 '19 at 07:32