2

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.

function seoUrl($string) {
    // Lower case everything
    $string = strtolower($string);
    // Make alphanumeric (removes all other characters)
    $string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
    // Clean up multiple dashes or whitespaces
    $string = preg_replace('/[\s-]+/', ' ', $string);
    // Convert whitespaces and underscore to dash
    $string = preg_replace('/[\s_]/', '-', $string);
    return $string;
}
Ωmega
  • 42,614
  • 34
  • 134
  • 203
Vesselina Taneva
  • 115
  • 2
  • 10

2 Answers2

2

You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:

~[^\p{Cyrillic}a-z0-9_\s-]~ui

You don't need to double escape \s.

PHP code:

preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
Ωmega
  • 42,614
  • 34
  • 134
  • 203
revo
  • 47,783
  • 14
  • 74
  • 117
1

To learn more about Unicode Regular Expressions see this article.

\p{L} or \p{Letter} matches any kind of letter from any language.

To match only Cyrillic characters, use \p{Cyrillic}

Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.

Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.

Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.


The following PHP code should work for you:

function seoUrl($string) {
    // Lower case everything
    $string = mb_strtolower($string);
    // Make alphanumeric (removes all other characters)
    $string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
    // Clean up multiple dashes or whitespaces
    $string = preg_replace('/[\s-]+/', ' ', $string);
    // Convert whitespaces and underscore to dash
    $string = preg_replace('/[\s_]/', '-', $string);
    return $string;
}

Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

Ωmega
  • 42,614
  • 34
  • 134
  • 203