Does a reliable way to capitalize Unicode text exist?

Question

I recently had to deal with some complex problems working with Unicode string (using PHP, a language I know pretty well). The mbstring extension was not really working properly and we had huge pains trying to capitalize Unicode letters, which with ASCII text is a trivial problem, already solved in a variety of ways.

If I had to solve this problem with ASCII text, I would probably just take the character, check if it is a letter and then subtract 32 from its ASCII value, for example! But as for now, I could not find anything explaining how the problem of capitalization of Unicode text has been solved: do I need to store a complete associative table to map every lowercase character to its related uppercase version? I suppose (and hope) I will hear a huge NO!

The heart of the question: does any method to correctly convert lowercases into uppercases (and back) exist when operating with Unicode characters? And if this is the case, which strategies are applied?

For this test suppose you do not have any, but really ANY module available: no mbstring, no iconv, nothing. Moreover, for the sake of simplicity suppose to have the problem of recognizing individual characters already solved, our String object has a nextChar() method which can be used to find the next character, independently from its byte-length. Suppose that what you want to do is taking a string, iterate over it with nextChar() and, for each character, capitalize it if possible.

If unclear or in the need of more information simply comment, I will try to answer your doubts, if they are not even bigger than mine at the moment ;)

I think this is indeed done with a table, *and* it is even worse because that table is dependent on the language of the text. An example is how in Turkish the upper case version of `i` is the dotted capital `İ`. Practically speaking I think the only way to do this is finding a library which can do this for you. — roeland, Jun 29 '16 at 21:55
You might be right, I noticed that it is even extremely difficult to find fixed rules! If we take the Greek language (U0391 - U03A9 for uppercases) we can follow an "add 22" rule with the exception of U03A2, but if we go past it and move to the Coptic alphabet characters seem to follow a sort of "add 1". I guess Unicode will bring more problems until we will find a way to handle it in a smarter way! — PoPeio, Jun 29 '16 at 22:06
Well you have to know what you're doing. Moving text around is easy enough but you have to be careful about knowing what encoding that text is in. If you start doing things like sorting, capitalisation, truncating, etc. you should just find a library. There's for instance a [wrapper for ICU](http://us2.php.net/manual/en/book.intl.php). — roeland, Jun 29 '16 at 23:10

score 1 · Accepted Answer · edited May 23 '17 at 11:52

1

You can try PortableUTF8 library, written as alternative to mbstring and iconv.

http://pageconfig.com/post/portable-utf8

Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .

https://github.com/danielstjules/Stringy

In order to improve knowledge of the problem it's interesting to read:

What factors make PHP Unicode-incompatible?

I hope it will be useful for you.

edited May 23 '17 at 11:52

Community

1
1

answered Jun 29 '16 at 21:31

manuelbcd

3,106
1
26
39

Thank you for the suggestion, but if you open the link (http://pageconfig.com/attachments/portable-utf8.php) and look for `function utf8_case_table( )` you will see he is using a mapping table, and my question is actually all about methods to avoid it! I am not asking for something I can copy->paste->deploy, I hope to understand something more about the available strategies for the purpose – PoPeio Jun 29 '16 at 21:49
1

Hi I didn't realize portable-utf8 was using a map table. The problem with php and unicode is very complex... actually PHP6 was abandonded in part due to Unicode problems. I'm sure you could be able to implement it. You can go deeper investigating mbstring extension source code, it's written in c but build your our version could require big ammounts of time. https://github.com/php/php-src/blob/master/ext/mbstring/php_unicode.c – manuelbcd Jun 29 '16 at 22:08
Absolutely! That's why I am not going to reinvent the wheel :P I was just trying to better understand how the existent wheels work. By the way, the `Stringy` stuff is quite interesting, I just find useless inserting any support for `foreach`, I am quite sure it will take individual bytes, not characters, but it's quite interesting, thanks for the share! – PoPeio Jun 29 '16 at 22:12
https://github.com/php/php-src/blob/master/ext/mbstring/unicode_data.h#LC2474 I guess there was no straightforward way to avoid it... – PoPeio Jun 29 '16 at 22:22
I have the same conclussion, let us know your final way please. Get lucky! – manuelbcd Jun 30 '16 at 04:59

Does a reliable way to capitalize Unicode text exist?

1 Answers1