replace all language spaces with standardized space

Question

Am working on a simple search input. It splits terms by space, which works nicely. However it does not recognize other languages spaces.

I'd like to preg_replace other languages spaces to a standardized space.

example,

$pattern       = array(
   //insert other language space codes here (I don't know what they are or how to find them) 
);
$replacement   = ' ';
$string        = "日本語　の　スペース　です";

$cleaned = preg_replace($pattern, $replacement, $string);

Did you try `preg_replace('/\s/', ' ', $string)`? Maybe Regex will catch other language spaces — sjagr, Nov 10 '14 at 17:56
@sjagr unfortunately it didn't catch it. It will catch the space if I type in the specific space. Which I will probably do in the meantime. — Trevor Wood, Nov 10 '14 at 18:00

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Use the u modifier in your pattern along with the \s escape sequence which will match any space character. This would look something like this (using your code):

$pattern   = '/\s/u';
$replacement = '';
$string        = "日本語　の　スペース　です";

$cleaned = preg_replace($pattern, $replacement, $string);

var_dump($cleaned);

Output:

string(30) "日本語のスペースです"

From the manual:

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

replace all language spaces with standardized space

1 Answers1