Counting number of Unicode characters for MySQL database

Question

When, in PHP, I have variable that's hold Unicode characters do I need to worry about storing it in MySQL database?

Lets say I have variable value Dejan Čolić, I suppose that will for Unicode characters use 2 chars (or 4, I am not sure). Then my variable will be 13/17 chars long in database instead of 11 chars that are firstly specified. Am I right?

If I am right, is there some PHP function that will help count future Unicode characters so I could do proper data validation? Thanks.

An interesting post on Unicode in general: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 (best viewed on Firefox with the [Symbola](http://users.teilar.gr/~g1951d/) font installed) — NullUserException, Aug 30 '11 at 14:32
Unicode characters have variable length, so they could have 1 to 4 bytes (although UTF-8 allows up to 6 bytes/character) — NullUserException, Aug 30 '11 at 14:34

score 2 · Answer 1 · answered Aug 30 '11 at 14:31

2

PHP: mb_strlen()
MySQL: CHAR_LENGTH()

answered Aug 30 '11 at 14:31

Marc B

356,200
43
426
500

In my tests `mb_strlen()` and `strlen()` are giving same results. Is there any difference? – Dexy_Wolf Aug 30 '11 at 14:39
strlen() is dumb and counts bytes. mb_strlen is unicode-aware and counts characters. If you're working on an ascii string, then they're the same. Otherwise strlen will over-count. – Marc B Aug 30 '11 at 14:49
Well, then `strlen()` is actually what I need, because MySQL will store characters in bytes length (if there are some chars like `Č`, `ć` etc). Am I right? – Dexy_Wolf Aug 30 '11 at 14:57
1

MySQL'll store the data in whatever charset/collation you specify on the table. http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html – Marc B Aug 30 '11 at 15:00

score 0 · Accepted Answer · answered Aug 30 '11 at 14:38

0

Characters are characters. 11 characters are 11 characters regardless of the environment. If MySQL doesn't know you are giving it UTF-8 encoded data, it may misinterpret the bytes and derive a different character count based on the assumption of a wrong encoding. But if MySQL is made aware of the correct encoding, it'll be 11 characters.

It'll be more bytes than that though, and even more bytes in MySQL since MySQL is storing UTF-8 data rather inefficiently.

answered Aug 30 '11 at 14:38

deceze

510,633
85
743
889

I my testings `Dejan Čolić` gives 21 and `Dejan Colic` 11? (With both `strlen()` and `mb_strlen()` on GET) – Dexy_Wolf Aug 30 '11 at 14:43
@Dejan Which encoding is the text in and which encoding does `mb_strlen` *think* the text is in? Try the second `$encoding` parameter of `mb_strlen`. – deceze Aug 30 '11 at 14:44
@Dejan BTW, reading this may help get a grip on the topic: http://kunststube.net/encoding/ – deceze Aug 30 '11 at 14:48
I made correction and added header. With `utf-8` there are results 13 and 11 (for `Dejan Čolić` and `Dejan Colic`). I suppose that `mb_strlen` think it is `utf-8`.. – Dexy_Wolf Aug 30 '11 at 14:49
1

`Č` = U+010C and `ć` = U+0107 = both are 2byte charts, so your strlen=13/mb_strlen=11 is perfectly accurate. – Marc B Aug 30 '11 at 15:01

Counting number of Unicode characters for MySQL database

2 Answers2