2

I am using char_length() to measure the size of "Русский": strangely, instead of telling me that it's 7 chars, it tells me there are 14. Interestingly if the query is simply...

SELECT CHAR_LENGTH('Русский')

...the answer is correct. However if I query the DB instead, the anser is 14:

SELECT CHAR_LENGTH(text) FROM locales WHERE lang = 'ru-RU' AND name = 'lang_name'

Anybody go any ideas what I might be doing wrong? I can confirm that the collation is utf8_general_ci and the table is MyISAM

Thanks, Adrien

EDIT: My end objective is to be able to measure the lengths of records in a table containing single and double-byte chracters (eg. English & Russian, but not limited to these two languages only)

Adrien Hingert
  • 1,416
  • 5
  • 26
  • 51

2 Answers2

2

Because of two bytes is used for each UTF8 char. See http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_char-length

mysql> set names utf8;
mysql> SELECT CHAR_LENGTH('Русский'); result - 7
mysql> SELECT CHAR_LENGTH('test'); result  - 4

create table test123 (
text VARCHAR(255) NOT NULL DEFAULT '',
text_text TEXT) Engine=Innodb default charset=UTF8;

insert into test123 VALUES('русский','test русский');

SELECT CHAR_LENGTH(text),CHAR_LENGTH(text_text) from test123; result - 7 and 12

I have tested work with: set names koi8r; create table and so on and got invalid result. So the solution is recreate table and insert all data after setting set names UTF8.

Sébastien
  • 11,860
  • 11
  • 58
  • 78
Andrej
  • 7,474
  • 1
  • 19
  • 21
  • I'm not sure I follow what you mean to say – Adrien Hingert Aug 17 '11 at 17:47
  • I meant that this one in UTF8 and you should set UTF8. Try to run 'set names utf8;' – Andrej Aug 17 '11 at 17:58
  • mysql> set names utf8; Query OK, 0 rows affected (0.00 sec) mysql> SELECT CHAR_LENGTH('Русский'); +-------------------------------+ | CHAR_LENGTH('Русский') | +-------------------------------+ | 7 | +-------------------------------+ 1 row in set (0.00 sec) mysql> SELECT CHAR_LENGTH('test'); +---------------------+ | CHAR_LENGTH('test') | +---------------------+ | 4 | +---------------------+ 1 row in set (0.00 sec) – Andrej Aug 17 '11 at 18:00
  • 1
    I have tested work with: set names koi8r; create table and so on and got invalid result. So the solution is recreate table and insert all data after setting set names UTF8. Try this solution. – Andrej Aug 17 '11 at 18:30
  • I have tried as you suggested:drop table, create table, set names, insert data, but I still get the same error. On the other hand, using your test123 table everything works as expected – Adrien Hingert Aug 18 '11 at 00:59
  • I have tried extracting the same information using PHP's mb_strlen() and the result is still 14. Yet when I try to output the Russian text it shows correctly – Adrien Hingert Aug 18 '11 at 01:28
  • Let me correct my last statement: using mb_strlen($row['text'], "utf-8") will give the correct answer – Adrien Hingert Aug 18 '11 at 01:31
  • So you should set names UTF8 at first and than create a new table and insert data but not create table/set names UTF8/insert data. – Andrej Aug 18 '11 at 07:14
  • Could it be that the actual data, having been entered in the DB through a webform, is somehow encoded in a weird way? – Adrien Hingert Aug 18 '11 at 11:13
  • Yes you should always after connecting to db send the first query 'set names UTF8;' and then run any queries. All should work correctly. – Andrej Aug 18 '11 at 11:19
1

the function return it's anwser guided by the most adjacent charset avaiable
in the case of a column, the column definition
in the case of a literal, the connection default
review the column charset with:

SELECT CHARACTER_SET_NAME FROM information_schema.`COLUMNS` 
where table_name = 'locales'
and column_name = 'text'

be careful, it is not filtered by table_schema

Saic Siquot
  • 6,513
  • 5
  • 34
  • 56