My two cents
The actual solution to your question is that you should first do encoding checks before working to alter foreign input strings. Many are quick to learn about "sanitizing and validating" input data, but slow to learn the step of identifying the underlying nature (character encoding) of the strings they are working with early on.
How many bytes will be used to represent each character? With properly formatted UTF-8, it can be 1 (the characters trim
deals with), 2, 3, or 4 bytes. The problem comes in when legacy, or malformed, representations of UTF-8 come into play--the byte character boundaries might not line up as expected (layman speak).
In PHP, some advocate that all strings should be forced to conform to proper UTF-8 encoding (1, 2, 3, or 4 bytes per character), where functions like trim()
will still work because the byte/character boundary for the characters it deals with will be congruent for the Extended ASCII / 1-byte values that trim()
seeks to eliminate from the start and end of a string (trim manual page).
However, because computer programming is a diverse field, one cannot possible have a blanket approach that works in all scenarios. With that said, write your application the way it needs to be to function properly. Just doing a basic database driven website with form inputs? Yes, for my money force everything to be UTF-8.
Note: You will still have internationalization issues, even if your UTF-8 issue is stable. Why? Many non-English character sets exist in the 2, 3, or 4 byte space (code points, etc.). Obviously, if you use a computer that must deal with Chinese, Japanese, Russian, Arabic, or Hebrew scripts, you want everything to work with 2, 3, and 4 bytes as well! Remember, the PHP trim
function can trim default characters, or user specified ones. This matters, especially if you need your trim
to account for some Chinese characters.
I would much rather deal with the problem of someone not being able to access my site, then the problem of access and responses that should not be occurring. When you think about it, this falls in line with the principles of least privilege (security) and universal design (accessibility).
Summary
If input data will not conform to proper UTF-8 encoding, you may want to throw an exception. You can attempt to use the PHP multi-byte functions to determine your encoding, or some other multi-byte library. If, and when, PHP is written to fully support unicode (Perl, Java ...), PHP will be all the better for it. The PHP unicode effort died a few years ago, hence you are forced to use extra libraries to deal with UTF-8 multi-byte strings sanely. Just adding the /u
flag to preg_replace()
is not looking at the big picture.
Update:
That being said, I believe the following multibyte trim would be useful for those trying to extract REST resources from the path component of a url (less the query string, naturally. Note: this would be useful after sanitizing and validating the path string.
function mb_path_trim($path)
{
return preg_replace("/^(?:\/)|(?:\/)$/u", "", $path);
}