Multibyte trim in PHP?

Question

Apparently there's no mb_trim in the mb_* family, so I'm trying to implement one for my own.

I recently found this regex in a comment in php.net:

/(^\s+)|(\s+$)/u

So, I'd implement it in the following way:

function multibyte_trim($str)
{
    if (!function_exists("mb_trim") || !extension_loaded("mbstring")) {
        return preg_replace("/(^\s+)|(\s+$)/u", "", $str);
    } else {
        return mb_trim($str);
    }
}

The regex seems correct to me, but I'm extremely noob with regular expressions. Will this effectively remove any Unicode space in the beginning/end of a string?

trim() will remove characters like " ,\t,\r,\n,\0,\x0B" and \s modifier like " ,\t,\r,\n,\v,\f" so it's not that You want I think. To remove some special characters from string than You can always use trim($str,$charlist) with second parameter. Can You write some examples of characters that You want to remove ? — Naki, Apr 08 '12 at 22:11
What characters do you want to remove that trim() does not remove? — Niko, Apr 08 '12 at 22:15
i think your regex matches 1 or more spaces at either the start or end of a line — Robbie, Apr 08 '12 at 22:25
@knittl, yes, you are right! Didn't realize that. The function I'm declaring should have another name. I was just checking if in any time in the future an `mb_trim` function is added to the `mbstring` extension, and using that one instead of my own — federico-t, Apr 09 '12 at 03:42
The problem here is that NBSP is a UTF8 char, so `\s` only detects NBSP with `/u` option. PHP is very confuse about "UTF8 compatible"... There are a FastGuide about what is and what not is "UTF8 safe" today?? Example: `str_replace` and `trim` are (on my view) UTF8 compatible, so, some functions not need an `mb_*` function, others need... And others, like `perg_*` need options to detect utf8 even implicit (see this `\s` implicit NBSP detection). — Peter Krauss, Sep 08 '14 at 13:56

score 65 · Accepted Answer · edited Jan 27 '15 at 10:47

65

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000.

Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx.

This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx can only refer to single-byte characters. PHP's trim function will therefore never trim away "half a character" assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)

The \s on ASCII regular expressions will mostly match the same characters as trim.

The preg functions with the /u modifier only works on UTF-8 encoded regular expressions, and /\s/u match also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.

If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.

In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim. When using /\s/u be careful with the meaning of nbsp for your text.

Take care:

  $s1 = html_entity_decode(" Hello &#160; "); // the NBSP
  $s2 = "  exotic test ホ  ";

  echo "\nCORRECT trim: [". trim($s1) ."], [".  trim($s2) ."]";
  echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";
  echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";

  echo "\n!INCORRECT trim: [". trim($s2,' ') ."]"; // DANGER! not UTF8 safe!
  echo "\nSAFE ONLY WITH preg: [". 
       preg_replace('/^[\s]+|[\s]+$/u', '', $s2) ."]";

edited Jan 27 '15 at 10:47

Pacerier

86,231
106
366
634

answered Apr 09 '12 at 00:23

deceze

510,633
85
743
889

`trim($s,'')` and `trim($s,' ')` works fine (!). The second example have an ASCII char working together... So we can say *"`trim()` function is UTF8 safe"* but not "`trim()` is ASCII, so is UTF8". People do confusion about `/\s/` and `/\s/u` where only the last detects NBSP. – Peter Krauss Sep 08 '14 at 13:50
3

wrong! this may seem to be working `trim($s,'')` but it can break the string to an invalid UTF-8 sequence. don't use it! – Wes Nov 18 '14 at 02:21
5

Indeed, trimming ASCII characters off of a UTF-8 string is safe, but trimming UTF-8 characters off of a string is not. That's because `trim` doesn't understand "" to be one character, but three bytes, and it will trim any of those three bytes off *individually* when encountered. @Peter – deceze Nov 18 '14 at 07:20
1

Sorry – is false to say "work fine" without a complete test, you are correct to say "`trim($s,$utf8)` is wrong!" –, I suggest to say this at aanswer's text. About my other comment, I think the answer's text "`\s` will mostly match the same characters" *is wrong*: please test by your self `preg_replace('/\s/u', '',$s)` when `$s = html_entity_decode(" Hello ");` countains the UTF8 [NBSP](https://en.wikipedia.org/wiki/Non-breaking_space). – Peter Krauss Nov 19 '14 at 07:15
2

Sticking to non-utf8-aware trim() is a solution only as long as all the characters you want to strip away are one-byte characters. But if you want, for example, to also strip away some multibyte characters (e.g. U+200B, the "zero width space") you need a proper multibyte extension of trim which is what the OP asks for. – matteo Feb 16 '18 at 18:17
@matteo However, with iteration and multiple phases and layers of filtering and validation, the zero width space is a non-issue, as the string should not validate. – Anthony Rutledge Aug 14 '18 at 01:42

score 26 · Answer 2 · edited Jul 17 '20 at 22:00

26

I don't know what you're trying to do with that endless recursive function you're defining, but if you just want a multibyte-safe trim, this will work.

function mb_trim($str) {
  return preg_replace("/^\s+|\s+$/u", "", $str); 
}

edited Jul 17 '20 at 22:00

mickmackusa

43,625
12
83
136

answered Apr 08 '12 at 22:58

kba

19,333
5
62
89

Are pregs in PHP aware of various encodings? I can't remember, but I know there was a problem with them once upon a time somewhere, and I think it was here. – Incognito Apr 08 '12 at 23:03
`trim($s,'')` and `trim($s,' ')` works fine (!). Why we need `mb_trim()`? – Peter Krauss Sep 08 '14 at 13:46
It would be better to use non-capturing subpatters. http://us1.php.net/manual/en/regexp.reference.subpatterns.php . They have the form `(?: )` – Anthony Rutledge Sep 07 '18 at 13:23

score 7 · Answer 3 · edited Feb 20 '23 at 18:05

7

This version supports the second optional parameter $charlist:

function mb_trim ($string, $charlist = null) 
{   
    if (is_null($charlist)) {
        return trim ($string);
    } 
    
    $charlist = str_replace ('/', '\/', preg_quote ($charlist));
    return preg_replace ("/(^[$charlist]+)|([$charlist]+$)/u", '', $string);
}

Does not support ".." for ranges though.

edited Feb 20 '23 at 18:05

Casimir et Hippolyte

88,009
5
94
125

answered Nov 08 '12 at 12:11

Edson Medina

9,862
3
40
51

2

I like your way but don't forget to preg_quote your $charlist :) – Alain Tiemblo Sep 04 '13 at 11:51
1

This fails for `mb_trim('000foo000', '0')`... :-3 – deceze Dec 04 '13 at 15:54
1

This should be slightly changed. Your $charlist = preg_quote line needs to come inside the else ortherwise the is_null($charlist) check never works. – Michael Taggart May 08 '15 at 18:47
I don't know if this was a old php version thing, but `$charlist` can be escaped using this simper call: `$charlist=preg_quote($charlist,'/');` or you can force it to be taken literally without a function call by wrapping it in `\Q` and `\E` See "Code 2" at my related post: https://codereview.stackexchange.com/a/178931/141885 p.s. Since you won't have any "any character" dots in `$charlist` you can omit the `s` flag from the end of the pattern. – mickmackusa Nov 16 '17 at 07:18
Might consider using `if($charlist === null)` instead of `is_null`, according to [this question and answer](https://stackoverflow.com/questions/8228837/is-nullx-vs-x-null-in-php), to bypass the overhead from `is_null`. – Zlatan Omerović Dec 29 '17 at 12:58

score 6 · Answer 4 · edited Feb 20 '23 at 18:07

Ok, so I took @edson-medina's solution and fixed a bug and added some unit tests. Here's the 3 functions we use to give mb counterparts to trim, rtrim, and ltrim.

////////////////////////////////////////////////////////////////////////////////////
//Add some multibyte core functions not in PHP
////////////////////////////////////////////////////////////////////////////////////
function mb_trim($string, $charlist = null) {
    if (is_null($charlist)) {
        return trim($string);
    } else {
        $charlist = preg_quote($charlist, '/');
        return preg_replace("/(^[$charlist]+)|([$charlist]+$)/u", '', $string);
    }
}
function mb_rtrim($string, $charlist = null) {
    if (is_null($charlist)) {
        return rtrim($string);
    } else {
        $charlist = preg_quote($charlist, '/');
        return preg_replace("/([$charlist]+$)/u", '', $string);
    }
}
function mb_ltrim($string, $charlist = null) {
    if (is_null($charlist)) {
        return ltrim($string);
    } else {
        $charlist = preg_quote($charlist, '/');
        return preg_replace("/(^[$charlist]+)/u", '', $string);
    }
}
////////////////////////////////////////////////////////////////////////////////////

Here's the unit tests I wrote for anyone interested:

public function test_trim() {
    $this->assertEquals(trim(' foo '), mb_trim(' foo '));
    $this->assertEquals(trim(' foo ', ' o'), mb_trim(' foo ', ' o'));
    $this->assertEquals('foo', mb_trim(' Åfooホ ', ' Åホ'));
}

public function test_rtrim() {
    $this->assertEquals(rtrim(' foo '), mb_rtrim(' foo '));
    $this->assertEquals(rtrim(' foo ', ' o'), mb_rtrim(' foo ', ' o'));
    $this->assertEquals('foo', mb_rtrim('fooホ ', ' ホ'));
}

public function test_ltrim() {
    $this->assertEquals(ltrim(' foo '), mb_ltrim(' foo '));
    $this->assertEquals(ltrim(' foo ', ' o'), mb_ltrim(' foo ', ' o'));
    $this->assertEquals('foo', mb_ltrim(' Åfoo', ' Å'));
}

Opty · Answer 5 · 2012-09-14T07:03:16.107

5

You can also trim non-ascii compatible spaces (non-breaking space for example) on UTF-8 strings with preg_replace('/^\p{Z}+|\p{Z}+$/u','',$str);

\s will only match "ascii compatible" space character even with the u modifier.
but \p{Z} will match all known unicode space characters

edited Sep 14 '12 at 07:03

answered Sep 14 '12 at 06:51

Opty

504
5
10

I edited @deceze, see about `/\s/u`, it is wrong to say "will only match ASCII" (because is not ASCII), can you correct it in your answer? About `\p{Z}`, sorry I not cited in my edit there, it is good to remember it (!). – Peter Krauss Nov 19 '14 at 23:15
As of PHP 7.2+ (possibly earlier), `\s` will match any Unicode space character (see my recent answer) with `u` on. Only `\p{Z}` will however not match regular ASCII spaces. I don't know if this was different in 2014, but as of 2020, this is not accurate. – Markus AO Jul 20 '20 at 19:28

score 2 · Answer 6 · answered May 24 '12 at 14:11

mb_ereg_replace seems to get around that:

function mb_trim($str,$regex = "(^\s+)|(\s+$)/us") {
    return mb_ereg_replace($regex, "", $str);
}

..but I don't know enough about regular expressions to know how you'd then add on the "charlist" parameter people would expect to be able to feed to trim() - i.e. a list of characters to trim - so have just made the regex a parameter.

It might be that you could have an array of special characters, then step through it for each character in the charlist and escape them accordingly when building the regex string.

score 2 · Answer 7 · edited Feb 20 '23 at 18:01

(Ported from a duplicate Q on trim struggles with NBSP.) The following notes are valid as of PHP 7.2+. Mileage may vary with earlier versions (please report in comments).

PHP trim ignores non-breaking spaces. It only trims spaces found in the basic ASCII range. For reference, the source code for trim reads as follows (ie. no undocumented features with trim):

(c == ' ' || c == '\n' || c == '\r' || c == '\t' || c == '\v' || c == '\0')

Of the above, aside the ordinary space (ASCII 32, ), these are all ASCII control characters; LF (10: \n), CR (13: \r), HT (9: \t), VT (11: \v), NUL (0: \0). (Note that in PHP, you have to double-quote escaped characters: "\n", "\t" etc.. Otherwise they are parsed as literal \n etc.)

The following are simple implementations of the three flavors of trim (ltrim, rtrim, trim), using preg_replace, that work with Unicode strings:

preg_replace('~^\s+~u', '', $string) // == ltrim
preg_replace('~\s+$~u', '', $string) // == rtrim
preg_replace('~^\s+|\s+$~u', '', $string) // == trim

Feel free to wrap them into your own mb_*trim functions.

Per the PCRE specification, the \s "any space" escape sequence character with u Unicode mode on will match all of the following space characters:

The horizontal space characters are:

U+0009     Horizontal tab (HT)
U+0020     Space
U+00A0     Non-break space
U+1680     Ogham space mark
U+180E     Mongolian vowel separator
U+2000     En quad
U+2001     Em quad
U+2002     En space
U+2003     Em space
U+2004     Three-per-em space
U+2005     Four-per-em space
U+2006     Six-per-em space
U+2007     Figure space
U+2008     Punctuation space
U+2009     Thin space
U+200A     Hair space
U+202F     Narrow no-break space
U+205F     Medium mathematical space
U+3000     Ideographic space

The vertical space characters are:

U+000A     Linefeed (LF)
U+000B     Vertical tab (VT)
U+000C     Form feed (FF)
U+000D     Carriage return (CR)
U+0085     Next line (NEL)
U+2028     Line separator
U+2029     Paragraph separator

You can see a test iteration of preg_replace with the u Unicode flag tackling all of the listed spaces. They are all trimmed as expected, following the PCRE spec. If you targeted only the horizontal spaces above, \h would match them, as \v would with all the vertical spaces.

The use of \p{Z} seen in some answers will fail on some counts; specifically, with most of the ASCII spaces, and shockingly, also with the Mongolian vowel separator. Kublai Khan would be furious. Here's the list of misses with \p{Z}: U+0009 Horizontal tab (HT), U+000A Linefeed (LF), U+000C Form feed (FF), U+000D Carriage return (CR), U+0085 Next line (NEL), and U+180E Mongolian vowel separator.

As to why that happens, the above PCRE specification also notes: "\s any character that matches \p{Z} or \h or \v". That is, \s is a superset of \p{Z}. Then, simply use \s in place of \p{Z}. It's more comprehensive and the import is more immediately obvious for someone reading your code, who may not remember the shorties for all character types.

Anthony Rutledge · Answer 8 · 2018-09-07T14:34:21.527

My two cents

The actual solution to your question is that you should first do encoding checks before working to alter foreign input strings. Many are quick to learn about "sanitizing and validating" input data, but slow to learn the step of identifying the underlying nature (character encoding) of the strings they are working with early on.

How many bytes will be used to represent each character? With properly formatted UTF-8, it can be 1 (the characters trim deals with), 2, 3, or 4 bytes. The problem comes in when legacy, or malformed, representations of UTF-8 come into play--the byte character boundaries might not line up as expected (layman speak).

In PHP, some advocate that all strings should be forced to conform to proper UTF-8 encoding (1, 2, 3, or 4 bytes per character), where functions like trim() will still work because the byte/character boundary for the characters it deals with will be congruent for the Extended ASCII / 1-byte values that trim() seeks to eliminate from the start and end of a string (trim manual page).

However, because computer programming is a diverse field, one cannot possible have a blanket approach that works in all scenarios. With that said, write your application the way it needs to be to function properly. Just doing a basic database driven website with form inputs? Yes, for my money force everything to be UTF-8.

Note: You will still have internationalization issues, even if your UTF-8 issue is stable. Why? Many non-English character sets exist in the 2, 3, or 4 byte space (code points, etc.). Obviously, if you use a computer that must deal with Chinese, Japanese, Russian, Arabic, or Hebrew scripts, you want everything to work with 2, 3, and 4 bytes as well! Remember, the PHP trim function can trim default characters, or user specified ones. This matters, especially if you need your trim to account for some Chinese characters.

I would much rather deal with the problem of someone not being able to access my site, then the problem of access and responses that should not be occurring. When you think about it, this falls in line with the principles of least privilege (security) and universal design (accessibility).

Summary

If input data will not conform to proper UTF-8 encoding, you may want to throw an exception. You can attempt to use the PHP multi-byte functions to determine your encoding, or some other multi-byte library. If, and when, PHP is written to fully support unicode (Perl, Java ...), PHP will be all the better for it. The PHP unicode effort died a few years ago, hence you are forced to use extra libraries to deal with UTF-8 multi-byte strings sanely. Just adding the /u flag to preg_replace() is not looking at the big picture.

Update:

That being said, I believe the following multibyte trim would be useful for those trying to extract REST resources from the path component of a url (less the query string, naturally. Note: this would be useful after sanitizing and validating the path string.

function mb_path_trim($path)
{
    return preg_replace("/^(?:\/)|(?:\/)$/u", "", $path);
}

Multibyte trim in PHP?

8 Answers8

My two cents

Summary

Linked

Related