PHP trim ignores non-breaking spaces, because the source code for trim says so:
(c == ' ' || c == '\n' || c == '\r' || c == '\t' || c == '\v' || c == '\0')
Aside ' ' the regular space (32), these are all ASCII control characters; 10, 13, 9, 11, 0. NBSP isn't included in ASCII, and trim()
isn't a Unicode-safe function. Do note that the above is quoted directly from PHP's source code for trim
; in PHP, you'd have to double-quote characters with trailing escape: "\n", "\t"
etc., otherwise they will be parsed as literal.
Should PHP developers add Unicode support for trim()
? Should they add Unicode support for any number of Unicode-unsafe string functions? Or create more mb_
variants (that are obviously slower)? These are the perennial debates of the wizards. Marvels pending, you're free to implement your own Unicode-trim functions for possible non-ASCII input cases:
preg_replace('~^\s+~u', '', $string) // == ltrim
preg_replace('~\s+$~u', '', $string) // == rtrim
preg_replace('~^\s+|\s+$~us', '\1', $string) // == trim
// preg_replace('~^\s+(.+?)?\s+$~us', '\1', $string) // edit: a redundant capture-group!
For example. None of this will obviously trim anything that's still a HTML entity of any description. Per PCRE specification, the \s
space regex in u
Unicode mode should match the following spaces. (It's not only NBSP that lurks in our unnormalized strings!)
The horizontal space characters are:
U+0009 Horizontal tab (HT)
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
The vertical space characters are:
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator
You may struggle with the NBSP, but did I ever tell you how I once wasted a season trying to trim strings with Mongolian vowel separators until I saw the light of Unicode. There are obviously more educated and elaborate Unicode white-space trimming efforts, here, here, there.
Edit: You can see a test iteration of how preg_replace
with the u
Unicode flag handles the spaces above. (They are all trimmed as expected, following the PCRE spec above.)
In any case, the question here wasn't "how", it was "why". The short of the why is, because if they added in the Non-breaking space, they'd also have to add the Medium mathematical space, the Ogham space mark, and the Mongolian vowel separator, among others.