2

In the PHP documentation string functions are listed that work on byte level. This works for SBCS strings, but not for MBCS strings. Luckily one famous encoding UTF-8 is backward compatible up to 7 bit US-ASCII.

Since PHP 5.6 the default encoding has changed to UTF-8, but it's string functions have not. The well known alternatives are iconv, Multibyte String and Intl. Also PCRE functions can be MBCS compliant when compiled in the right way.

When SBCS code of age needs to be transformed to VMBCS (UTF-8) compliance, the standard PHP byte string functions needs to be rewritten to be MBCS safe. Although the most basic functions (like strpos()) have an mb_* variant (like mb_strpos()) most of PHP's string functions have no mb_ counterpart. For continued use they have to be rewritten.

In the first stage, one needs to determine which SBCS string functions will work despite their byte oriented nature. Some have been identified already on SO, what I'm looking for now is a comprehensive list of functions that will work with UTF-8, or when used with caution, for example parameters with US-ASCII only. To clarify, the question is not about the byte string functions like chr() or crc32(), it's about getting a list of functions like:

  • Not safe: count_chars() counts bytes, ...
  • Caution: ltrim() will work as long as parameters are US-ASCII, ...
  • Safe: str_repeat() will work with MBCS strings, ...

Would anybody know such a list?

Code4R7
  • 2,600
  • 1
  • 19
  • 42
  • Have you searched [the list of PHP string functions](http://php.net/manual/en/ref.strings.php) in the documentation? – axiac Jun 13 '17 at 12:40
  • Yes, I just need a list to know which functions to rewrite. – Code4R7 Jun 13 '17 at 13:39
  • 1
    Providing a list of stuff is a bad fit for SO, but it's enough of an issue in PHP that it's a reasonable question. Unfortunately the most viable thing is to really understand UTF-8 and what each individual string function does, from which you can conclude yourself whether what you're doing is safe or not. Yeah, not a terribly satisfying answer, I know… – deceze Jun 13 '17 at 13:43
  • @deceze I've seen many questions whether function X is multibyte safe. While those questions do fit SO I agree with the bad fit. I've considered SO Documentation, but I've read they're going to change the structure. – Code4R7 Jun 13 '17 at 13:46

2 Answers2

2

Core PHP SBCS string functions

Assuming the default encoding of PHP is set to UTF-8, these string functions will work:

Unfortunately all other string functions do not work with UTF-8. Obstacles:

  • case handling or spaces does not work with UTF-8
  • string lengths in parameters and return values are not in character lengths
  • string processing causes data corruption
  • string function is comletely ASCII oriented

In some cases functions can work as expected when parameters are US-ASCII and lengths are byte lenghts.

Binary string function are still useful:

  • bin2hex Convert binary data into hexadecimal representation
  • chr Return a specific character (=byte)
  • convert_uudecode Decode a uuencoded string
  • convert_uuencode Uuencode a string
  • crc32 Calculates the crc32 polynomial of a string
  • crypt One-way string hashing
  • hex2bin Decodes a hexadecimally encoded binary string
  • md5_file Calculates the md5 hash of a given file
  • md5 Calculate the md5 hash of a string
  • ord Return ASCII value of character (=byte)
  • sha1_file Calculate the sha1 hash of a file
  • sha1 Calculate the sha1 hash of a string

Configuration functions do not apply:

Regular expression functions and encoding and transcoding functions are not considered.

Extentions

In quite a few cases, Multibyte String offers an UTF-8 variant:

  • mb_convert_case Perform case folding on a string
  • mb_parse_str Parse GET/POST/COOKIE data and set global variable
  • mb_split Split multibyte string using regular expression
  • mb_strcut Get part of string
  • mb_strimwidth Get truncated string with specified width
  • mb_stripos Finds position of first occurrence of a string within another, case insensitive
  • mb_stristr Finds first occurrence of a string within another, case insensitive
  • mb_strlen Get string length
  • mb_strpos Find position of first occurrence of string in a string
  • mb_strrchr Finds the last occurrence of a character in a string within another
  • mb_strrichr Finds the last occurrence of a character in a string within another, case insensitive
  • mb_strripos Finds position of last occurrence of a string within another, case insensitive
  • mb_strrpos Find position of last occurrence of a string in a string
  • mb_strstr Finds first occurrence of a string within another
  • mb_strtolower Make a string lowercase
  • mb_strtoupper Make a string uppercase
  • mb_strwidth Return width of string
  • mb_substr_count Count the number of substring occurrences
  • mb_substr Get part of string

And iconv provides a bare minimum of string functions:

Lastly Intl has a lot of extra and powerful Unicode features (but no regular expressions) as part of i18n. Some features overlap with other string functions. With respect to string functions these are:

Code4R7
  • 2,600
  • 1
  • 19
  • 42
1

The PHP standard string functions are not able to handle multi-byte strings correctly. They handle their arguments as single-byte character strings, no matter what kind of strings you pass to them. They don't operate on characters but on bytes.

PHP doesn't keep the encoding of each string. It handles all of them the same way.

The PHP multi-byte string functions provided by the mbstring PHP extension can handle multiple character encodings, convert between encodings and auto-detect the encoding of a given string. They operate on characters and are able to handle both fixed-length encodings (UTF-16, f.e.) and variable-length encodings (UTF-8).

axiac
  • 68,258
  • 9
  • 99
  • 134
  • 1
    +1 for mostly true, PHP operates on byte level yes, which means that those functions will mess up your MBCS. But UTF-8 is US-ASCII compliant, some functions will work despite their 8-bit byte design. – Code4R7 Jun 13 '17 at 13:00
  • 1
    All PHP string functions work well with UTF-8 encoded strings as long as the strings use only 7-bit ASCII characters (because the encoding of the first 128 characters is identical in ASCII and UTF-8). – axiac Jun 13 '17 at 13:26
  • 2
    There are many standard string functions which work just fine on UTF-8 strings, e.g. `str_replace`. Due to the nature of what they're doing it doesn't matter whether they operate on a byte or character basis. Others on the other hand will fail, e.g. `trim` with a custom non-ASCII character mask (but with an ASCII-only mask it works just fine as well). – deceze Jun 13 '17 at 13:37
  • 1
    @axiac Not true, check [`htmlentities`](http://php.net/manual/en/function.htmlentities.php) which [does work with UTF-8](https://stackoverflow.com/questions/5679715/htmlentities-destroys-utf-8-strings#answer-5679795). – Code4R7 Jun 13 '17 at 13:56