Best practices for dealing with UTF-8 and Unicode in PHP 8

Question

Dealing with Unicode and UTF-8 has been a nightmare in PHP for years, but I've always hoped things would get better with PHP 8. Have they?

What considerations must a developer using PHP 8 make with regards to receiving, processing, storing, and returning content in UTF-8?

I know about UTF-8 all the way through, but how much of that advice still applies in PHP 8? Are there new and better ways to handle Unicode in PHP? Are standard string functions now UTF-8 safe?

This is too broad, opinianted and there's too many questions for this to be a good stack overflow question. If there's a problem you're trying to solve, try to focus on that. If not, it might be better to try a different forum like the PHP subreddit? — Evert, Jul 19 '21 at 02:16
It depends also on what do you mean with "dealing". For most cases, just normalization and treating strings as binary blob is the way to go. Else, most of string functions will cause problems on specific cases on specific languages (too many languages with too many different rules). Fortunately nobody needs to build a universal site for all languages (better to have different sites for some specific families, also because overall design and readability). — Giacomo Catenazzi, Jul 19 '21 at 06:24
The question I think people should be asking themselves more often is "what level of abstraction am I interested in right now?" For instance, "what is the length of this string?" has many different answers; `strlen`, `mb_strlen`, and `grapheme_strlen` will all give you different but **equally valid** answers. Similarly, `substr` is a common example, but how often do you _really_ want "the first 20 Unicode code points, even if the result makes no visual sense" rather than "the first 20 visual characters (graphemes)", or even "at most 20 bytes without dividing a grapheme". — IMSoP, Jul 19 '21 at 13:16

score 3 · Answer 1 · answered Jul 19 '21 at 07:32

A developer needs to know what UTF-8 really is, before it is used in a program.

Only then you'll understand why things can not get any better in any programming language without having to rewrite the language completely. So UTF-8 has nothing to do with PHP specifically.

UTF-8 has the advantage, compared to other Unicode encoding schemes, of being backward compatible with US-ASCII and being self synchronising. Because of the backward compatibility, many of PHP's SBCS functions will also work with MBCS strings.

To answer the question, not much has changed with PHP 8, other than that the default encoding is UTF-8 since PHP 5.6. The article still applies. You'll need an extention to be able to work with multibyte encodings, and you'll need to take into account the encoding scheme of different data sources like the database you're using, which is common for all programming languages.

Best practices for dealing with UTF-8 and Unicode in PHP 8

1 Answers1