9

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.

The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.

Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):

  • The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).

  • All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).

  • All functions with Unicode versions use those instead (eg, Collator::sort for sort).

  • All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).

  • All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).

For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

tchrist
  • 78,834
  • 30
  • 123
  • 180

2 Answers2

6

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.


One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example, mb_substr() is called instead of substr() if function overloading is enabled.

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Really? It was *cancelled?* Why in the world would anybody want to cancel something so important, now that virtually the entire web is Unicode now? :( – tchrist Apr 23 '11 at 15:45
  • If there isn’t already a way to do it (I’d thought there already was), then I would also accept an answer that showed how to code up such a thing. – tchrist Apr 23 '11 at 15:46
  • It has *(even if I'm not sure if "canceled" is the right word, that's the basic idea)* ; see http://news.php.net/php.internals/47120 ;;; how to code a thing such as "all in unicode" ? Well, that was the idea of PHP 6 -- and quite a lot of work... – Pascal MARTIN Apr 23 '11 at 15:47
  • If there is a `mbstring.func_overload`, it seems to me that there should be a way to make that work for other functions, too, including the `grapheme_*` ones. Similarly, the deprecated `mb_regex_set_options` seems to be just what is needed — save that it doesn’t include `/u` for `preg_*`. Why is it so much work? Is the problem that PHP’s module/extension mechanism isn’t rich enough to make such extensions natural and easy to write? Can’t you just add something to a table somewhere, especially in the first case? Thank you, Pascal. – tchrist Apr 23 '11 at 15:54
  • I suppose you could use some mecanism like the overloading used by mbstring, to just re-define all PHP functions ; but I let you imagine how much time it would take you to re-code all those functions, in C ;-) *(that was part of the PHP 6 development, btw, going over all PHP functions in order to make sure they would work with unicode)* – Pascal MARTIN Apr 23 '11 at 15:55
  • Darn, I think I’ve found what I was dimly remembering. It was `unicode_semantics = On` from [this proposal](http://news.php.net/php.internals/17771). I guess that never happened, then. Did they really end up using nasty UTF-16 internally? I hope not: UTF-8 makes for smaller HTML or XML even for CJK, because of the density of ASCII mark-up. – tchrist Apr 23 '11 at 16:00
  • UTF-16 was what would have been used for PHP 6. – Pascal MARTIN Apr 23 '11 at 16:26
  • That would have been unfortunate: UTF-16 is a horrible mess: UTF-16 has all the disadvantages of UTF-8 and none of the advantages of UTF-8. It’s very error-prone, too. – tchrist Apr 23 '11 at 22:32
5

All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).

This isn't a good idea.

Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.

For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.

You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.

This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.

(*: or UTF-16 code unit semantics if unlucky)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Perl seems to get by with `length` for everything. You just have to encode from internal logical characters to UTF-8 at times: `print "Content-Length: ", length(utf8_encode($payload))`. But those occasions are rare, so having the normal `strlen` be bytes not characters is a Huffman failure: the short thing should be the common thing. – tchrist Dec 15 '12 at 17:36