I'm having some trouble matching/replacing the ZWSP unicode encoded as UTF8
ZWSP: \x20\x0B
ZWSP (UTF8): \xE2\x80\x8B
As an extra test case I have used NBSP (Non-breaking space) which works as expected
All preg_replace
are in UTF8 mode /u
When matching NBSP it works as expected. The input is encoded as UTF8 and the output is empty (NBSP unicode replaced with an empty string)
When matching ZWSP it only works if the ZWSP input is not UTF8 encoded.
If you change the ZWSP pattern to the UTF8 encoded version and keep input as UTF8 it doesn't work either
Q: Then how to match ZWSP in UTF8 ?
... or is this a bug?
code
$nbsp = '\xA0'; // Non-breaking space
$zwsp = '\x20\x0B'; // Zero-width space
$zwsp_utf8 = '\xE2\x80\x8B';
$input_nbsp_utf8 = "\xC2\xA0";
$input_zwsp = "\x20\x0B";
$input_zwsp_utf8 = "\xE2\x80\x8B";
// NBSP
echo "NBSP\n-----\n";
echo "in: $input_nbsp_utf8--\nhex: ".bin2hex($input_nbsp_utf8)."\n";
$output = preg_replace('/'.$nbsp.'/u', '', $input_nbsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (input: **not** UTF8)
echo "ZWSP (input: **not** UTF8)\n-----\n";
echo "in: $input_zwsp--\nhex: ".bin2hex($input_zwsp)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (input: UTF8)
echo "ZWSP (input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
// ZWSP (pattern: UTF8, input: UTF8)
echo "ZWSP (pattern: UTF8, input: UTF8)\n-----\n";
echo "in: $input_zwsp_utf8--\nhex: ".bin2hex($input_zwsp_utf8)."\n";
$output = preg_replace('/'.$zwsp_utf8.'/u', '', $input_zwsp_utf8);
echo "out: $output--\nhex: ".bin2hex($output)."\n\n";
Output
NBSP
-----
in: --
hex: c2a0
out: --
hex:
ZWSP (input: **not** UTF8)
-----
in:
--
hex: 200b
out: --
hex:
ZWSP (input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty
ZWSP (pattern: UTF8, input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty