In the PHP documentation string functions are listed that work on byte level. This works for SBCS strings, but not for MBCS strings. Luckily one famous encoding UTF-8 is backward compatible up to 7 bit US-ASCII.
Since PHP 5.6 the default encoding has changed to UTF-8, but it's string functions have not. The well known alternatives are iconv, Multibyte String and Intl. Also PCRE functions can be MBCS compliant when compiled in the right way.
When SBCS code of age needs to be transformed to VMBCS (UTF-8) compliance, the standard PHP byte string functions needs to be rewritten to be MBCS safe. Although the most basic functions (like strpos()
) have an mb_*
variant (like mb_strpos()
) most of PHP's string functions have no mb_
counterpart. For continued use they have to be rewritten.
In the first stage, one needs to determine which SBCS string functions will work despite their byte oriented nature. Some have been identified already on SO, what I'm looking for now is a comprehensive list of functions that will work with UTF-8, or when used with caution, for example parameters with US-ASCII only. To clarify, the question is not about the byte string functions like chr()
or crc32()
, it's about getting a list of functions like:
- Not safe:
count_chars()
counts bytes, ... - Caution:
ltrim()
will work as long as parameters are US-ASCII, ... - Safe:
str_repeat()
will work with MBCS strings, ...
Would anybody know such a list?