4

I'm trying to trim unicode whitespaces such as this characters and I was able to do that using this solution.. The problem with this solution is that it doesn't trim the unicode whitespaces IN BETWEEN normal characters..For example with this one using Thin Space

$string = "   test   string   ";
echo preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $string);
// outputs: test   string

I have a small understanding about RegEx so I don't know what to alter on my expression to resolve this issue

saionachi
  • 397
  • 1
  • 5
  • 21

2 Answers2

8

Such Unicode whitespaces how \u{2009} cause problems in various places. I would therefore replace all unicode spaces with regular spaces and then apply trim().

$string = "   test   string and XY \t ";
//\u{2009}\u{2009}\u{2009}test\u{2009}\u{2009}\u{2009}string\u{2009}and\x20XY\x20\x09\u{2009}

$trimString = trim(preg_replace('/[\pZ\pC]/u', ' ', $string));
//test\x20\x20\x20string\x20and\x20XY

Note: The representation of the strings in the comment was made with debug::writeUni($string, $trimString); realized from this class.

jspit
  • 7,276
  • 1
  • 9
  • 17
2

To remove all Unicode whitespace with control chars at the start and end of string, and remove all Unicode whitespace with control chars other than regular space anywhere inside the string, you can use

preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$|(?! )[\pZ\pC]/u', '', $string)
// Or, simply
preg_replace('/^\s+|\s+$|[^\S ]/u', '', $string)

See the regex demo #1 and regex demo #2.

Details

  • ^[\pZ\pC]+ - one or more whitespace or control chars at the start of string
  • | - or
  • [\pZ\pC]+$ - one or more whitespace or control chars at the end of string
  • | - or
  • (?! )[\pZ\pC] - one or more whitespace or control chars other than a regular space anywhere inside the string
  • [^\S ] - any whitespace other than a regular space (\x20)

If you need to "exclude" common line break chars, too, replace (?! )[\pZ\pC] with (?![ \r\n])[\pZ\pC] (as suggested by @MonkeyZeus), in the second regex, it means you need to use [^\S \r\n].

See PHP demo:

echo preg_replace('~^[\pZ\pC]+|[\pZ\pC]+$|(?! )[\pZ\pC]~u', '', 'abc def ghi      ');
// => abc defghi
echo preg_replace('/^\s+|\s+$|[^\S ]/u', '', 'abc def ghi     ');
// => abc defghi
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It's important to note that `\s` is only a subset of `[\pZ\pC]`. In particular it doesn't include zero-width space and zero width non-breaking space, both notorious characters that you probably want to remove. See [this page](https://en.wikipedia.org/wiki/Whitespace_character#Unicode) for reference. – Paul Mar 13 '23 at 23:07