What is the purpose of the MB_CASE_*_SIMPLE constants?

Question

According to the manual, the following constants have been added in PHP 7.3:

MB_CASE_FOLD
MB_CASE_LOWER_SIMPLE
MB_CASE_UPPER_SIMPLE
MB_CASE_TITLE_SIMPLE
MB_CASE_FOLD_SIMPLE

I found an example of what MB_CASE_FOLD does:

echo mb_convert_case('ẞ', MB_CASE_FOLD, 'UTF-8'); // ss

However, I could not find any reference to what the MB_CASE_*_SIMPLE constants do.

At first glance, with simple latin1 characters, MB_CASE_LOWER_SIMPLE behaves just like MB_CASE_LOWER.

What do the MB_CASE_*_SIMPLE do different from their MB_CASE_* counterparts?

Can't find too much good information on it but it looks like it makes a difference in some languages/glyphs. https://translate.google.com/translate?hl=en&sl=pl&u=https://geek.justjoin.it/nowego-php-v7-3-opisalismy-wszystkie-34-zmiany/ — smcd, Nov 14 '19 at 14:23

score 9 · Accepted Answer · edited Aug 24 '20 at 14:21

We can find the corresponding C implementation at https://github.com/php/php-src/blob/master/ext/mbstring/php_unicode.c#L223

And have a look at the git commit message:

Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string.

mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are:

MB_CASE_LOWER (used by mb_strtolower)

MB_CASE_UPPER (used by mb_strtolower)

MB_CASE_TITLE

MB_CASE_FOLD

MB_CASE_LOWER_SIMPLE

MB_CASE_UPPER_SIMPLE

MB_CASE_TITLE_SIMPLE

MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)

So those constants with _SIMPLE suffix are for Unicode's Simple Case Folding, and those WITHOUT the suffix are for Full Case Folding.

And that answers the differences on Full Case Folding vs Simple Case Folding.

score 7 · Answer 2 · answered Nov 14 '19 at 15:01

Here are some examples where it matters:

MB_CASE_UPPER_SIMPLE:

mb_convert_encoding("ß", MB_CASE_UPPER_SIMPLE); // "ß"
mb_convert_encoding("ß", MB_CASE_UPPER); // "SS"

MB_CASE_LOWER_SIMPLE:

mb_convert_encoding("İ", MB_CASE_LOWER_SIMPLE); // "i"
mb_convert_encoding("İ", MB_CASE_LOWER); // "i\xcc\x87"

MB_CASE_TITLE_SIMPLE is similar to MB_CASE_UPPER_SIMPLE in the same way that MB_CASE_UPPER is similar to MB_CASE_TITLE.

score 0 · Answer 3 · answered Jul 03 '23 at 10:39

There are two kind of case-mapping:

Simple case-mapping
Full case-mapping

Simple case-mapping is one-to-one character mapping, for example a single character "A" is replaced with another single character "a".

Full case-mapping performs one-to-many character replacements (more precisely one-to-many code-points). In real world use-cases, it's rare to perform full case-mapping, this is because it only concerns a very small set of characters. For example in german language, the letter "ß" is strictly lowercase and should be mapped to "SS" in uppercase words.

Source: https://jawira.github.io/case-converter/case-mapping.html

What is the purpose of the MB_CASE_*_SIMPLE constants?

3 Answers3