Why does PHP's trim() ignore certain "kinds" of whitespace?

Question

I just wasted hours this morning with endless experiments after my trim() wrapper function seemed to not delete the annoying space in the end of an input string.

Turns out it wasn't a normal " " space, but a "NO-BREAK SPACE", which looks identical but is apparently different and entirely "blind" to trim().

I've found numerous "solutions", all feeling "wrong", but while I'm sure I could hack together something to solve the immediate problem, that's not my concern. My concern is why exactly PHP's trimmer doesn't trim "NO-BREAK SPACE" characters. It always seems like the built-in functions or features in PHP have some serious issue which make them useless to rely on in practice, and thus I have to make some hack of my own.

It's strange to me that the PHP developers would not have thought of this situation and simply made trim() able to understand this (and all other) whitespaces.

How come they seem to not "upgrade" ancient PHP functions but just let them rot and have every individual developer find their own messy, fragile solution?

I mean, to point out the obvious PHP is OSS, feel free to suggest it to the core devs and / or implement a patch for consideration. — Jonnix, Jul 17 '20 at 16:58
If you see https://www.php.net/manual/en/function.trim.php, you'll see what it's designed to trim, The no-break space is not one of them. If you think it should be added in, you can request it. — aynber, Jul 17 '20 at 17:04
Backwards compatibility. If they change the default value for the second argument every couple of years then people will yell "why do you break my code?". You can't accomplish contradictory things. — Álvaro González, Jul 17 '20 at 17:07
Since this question is tagged by "Unicode", let me remind you that the `trim` function is NOT Unicode (UTF-8) compatible/compliant/aware. — julp, Jul 17 '20 at 17:46

Markus AO · Answer 1 · 2020-07-24T19:56:27.393

2

Edited & expanded version of this answer at: Multibyte trim in PHP?

PHP trim ignores non-breaking spaces, because the source code for trim says so:

(c == ' ' || c == '\n' || c == '\r' || c == '\t' || c == '\v' || c == '\0')

Aside ' ' the regular space (32), these are all ASCII control characters; 10, 13, 9, 11, 0. NBSP isn't included in ASCII, and trim() isn't a Unicode-safe function. Do note that the above is quoted directly from PHP's source code for trim; in PHP, you'd have to double-quote characters with trailing escape: "\n", "\t" etc., otherwise they will be parsed as literal.

Should PHP developers add Unicode support for trim()? Should they add Unicode support for any number of Unicode-unsafe string functions? Or create more mb_ variants (that are obviously slower)? These are the perennial debates of the wizards. Marvels pending, you're free to implement your own Unicode-trim functions for possible non-ASCII input cases:

preg_replace('~^\s+~u', '', $string) // == ltrim
preg_replace('~\s+$~u', '', $string) // == rtrim
preg_replace('~^\s+|\s+$~us', '\1', $string) // == trim
// preg_replace('~^\s+(.+?)?\s+$~us', '\1', $string) // edit: a redundant capture-group!

For example. None of this will obviously trim anything that's still a HTML entity of any description. Per PCRE specification, the \s space regex in u Unicode mode should match the following spaces. (It's not only NBSP that lurks in our unnormalized strings!)

The horizontal space characters are:

U+0009     Horizontal tab (HT)
U+0020     Space
U+00A0     Non-break space
U+1680     Ogham space mark
U+180E     Mongolian vowel separator
U+2000     En quad
U+2001     Em quad
U+2002     En space
U+2003     Em space
U+2004     Three-per-em space
U+2005     Four-per-em space
U+2006     Six-per-em space
U+2007     Figure space
U+2008     Punctuation space
U+2009     Thin space
U+200A     Hair space
U+202F     Narrow no-break space
U+205F     Medium mathematical space
U+3000     Ideographic space

The vertical space characters are:

U+000A     Linefeed (LF)
U+000B     Vertical tab (VT)
U+000C     Form feed (FF)
U+000D     Carriage return (CR)
U+0085     Next line (NEL)
U+2028     Line separator
U+2029     Paragraph separator

You may struggle with the NBSP, but did I ever tell you how I once wasted a season trying to trim strings with Mongolian vowel separators until I saw the light of Unicode. There are obviously more educated and elaborate Unicode white-space trimming efforts, here, here, there.

Edit: You can see a test iteration of how preg_replace with the u Unicode flag handles the spaces above. (They are all trimmed as expected, following the PCRE spec above.)

In any case, the question here wasn't "how", it was "why". The short of the why is, because if they added in the Non-breaking space, they'd also have to add the Medium mathematical space, the Ogham space mark, and the Mongolian vowel separator, among others.

edited Jul 24 '20 at 19:56

answered Jul 17 '20 at 20:43

Markus AO

4,771
2
18
29

I have closed this page with a much earlier page that offers better resolving patterns. I don't like these: `preg_replace('~^\s*~u', '', $string) // == ltrim preg_replace('~\s*$~u', '', $string) // == rtrim` 1. they will replace nothing with nothing (a waste of processing effort) and 2. they can be merged into a single pattern. (see the dupe) – mickmackusa Jul 17 '20 at 22:04
`'\n'` is not the same as `"\n"`. That first double-pipe delimited battery of sample whitespace characters is misleading. – mickmackusa Jul 17 '20 at 22:05
I don't like `preg_replace('~^\s*(.+)\s*$~u', '\1', $string) // == trim` because 1. It will replace nothing with nothing 2 It will not replace a solitary space (proof: https://3v4l.org/EQs8Y) and 3. It will not trim whitespaces if there is a linebreak in the middle of the string (proof: https://3v4l.org/8k8cv) – mickmackusa Jul 17 '20 at 22:09
1

I can see that you have spent some time crafting this incorrect, misleading, and suboptimal answer. Normally I would promise to remove my DV if the post is corrected, but because this question is a duplicate of several questions already posted on SO, I will not be removing my vote. Please make a habit of closing close-able questions so that SO has less redundant content. When you answer, make sure that your suggestions work and are correct so that you do not poison the SO knowledge pool. – mickmackusa Jul 17 '20 at 22:11
@mickmackusa thanks for your feedback. Yes, the `ltrim` and `rtrim` can be merged into a single pattern, which would then be `trim`, not `ltrim` or `rtrim`. As for l/rtrim regex "replace nothing with nothing", I'm under the impression that no action is taken if no match happens, do correct me if I'm wrong. The `trim` will indeed fail with a multiline subject because `s` dotall isn't on, and also because the capture group is greedy. I've now fixed it. As stated, the regexes are rudimentary examples, with better solutions linked. – Markus AO Jul 18 '20 at 14:50
This is primarily an answer to **"why"** (per the OP's question), not "how to" (ergo, the "why" explained at more length). The list of white-space characters is a direct quote from the PHP source code for `trim` (linked in the answer), clarification added. The notes on PCRE white-space character spec and the issues surrounding PHP's multibyte support are, as far as I can tell, useful and relevant information. If an answer has a code specimen that fails in some cases, and if that makes the post wholesale "incorrect, misleading, and suboptimal", I won't lose sleep over the downvote. :) – Markus AO Jul 18 '20 at 14:57
@mickmackusa all it took was `?` into the `trim` equivalent to eliminate the space. `preg_replace('~^\s*(.+?)?\s*$~us', '\1', $string)`. In any case, the capture-group was redundant, when `|` would do for trim-only, so edited. I've also changed `*` to `+`, for whatever it's worth. There, should be bug-free, and I've added in a unit test iterating the PCRE white-space list to demonstrate that simply `\s` will really do the job. N.B. The thread you linked as the answer contains no specific information on the range of possible white-space characters that may be encountered. – Markus AO Jul 18 '20 at 17:24
this page is very likely to be purged. If you would like your effort to be preserved, I recommend that you find the earliest Stack Overflow page that requests the guidance included in this post and write a new answer on the older page. Please ensure that when you transfer your post that you are not generating redundant advice on the page (only unique content is valuable to researchers; duplicated advice wastes researcher time). You might find a suitable page in the list of duplicates that I have closed this page with. Please also read my commented link to Meta under the question. – mickmackusa Jul 19 '20 at 00:40
1

@mickmackusa thanks for the heads-up. Read the Meta discussion you linked, point taken. I've ported an edited and expanded version of my answer to: https://stackoverflow.com/questions/10066647/multibyte-trim-in-php/62983015#62983015 – Markus AO Jul 19 '20 at 16:45

Why does PHP's trim() ignore certain "kinds" of whitespace?

1 Answers1