2

I have a multi-byte text of 156 characters encoded in UTF-8 format and verified by PHP function mb_strlen($text, 'UTF-8') to be of 156 length. I was expecting to be able to store all of it with VARCHAR(156). But a good portion of the text got truncated.

This is my original text:

위키백과, 백과사전.

대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = xx + 입력 x 것이다. x ↦ x 와 y ↦ y 는 변수의 이름은. 또한 (x, y) ↦ xx + yy 와 (u, v) ↦ uu + v*v 는.123456

This is what I got in MySQL:

위키백과, 백과사전.

대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = x*x + ìž…ë ¥ x 것ì´ë‹¤. x ↦ x 와 y ↦ y 는 변수ì

This is what is generated upon querying on my web page:

위키백과, 백과사전.

대수(λ -, lambda -)는 함. 1930년대 다. 함수 s(x, y) = x*x + 입력 x 것이다. x ↦ x 와 y ↦ y 는 변수�

There is a similar question on Stack Overflow, but it does not seem to address my question. Note that the table CHARSET=utf8 collation have been changed to UTF-8, General CI and column collation uses table default. I am using MySQL version 5.5.14 with system variables as shown:

+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | utf8                                   |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | utf8                                   |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci                        |
| collation_database       | utf8_general_ci                        |
| collation_server         | utf8_general_ci                        |
+--------------------------+----------------------------------------+

UPDATE:

After running mysqli_query($cxn, "SET NAMES utf8") on PHP script as suggested by Homer6, it did take in the full 156 characters and renders as per my original text.

But now what is generated on my web page becomes:

????, ????. ??(? -, lambda -)? ?. 1930?? ?. ?? s(x, y) = xx + ?? x ???. x ? x ? y ? y ? ??? ???. ?? (x, y) ? xx + yy ? (u, v) ? uu + v*v ?.123456

Can anyone help me?

Community
  • 1
  • 1
Question Overflow
  • 10,925
  • 18
  • 72
  • 110

3 Answers3

2

Can you try quadrupling the size to 624? I think the size is in bytes, not characters. And UTF-8 can be between 1 and 4 bytes.

See http://unicode.org/faq/utf_bom.html

Also, are you setting

SET NAMES 'utf8';

before you run your query?

Or, for Korean, what happens if you set

mysql_query( 'SET NAMES euckr_korean_ci' );

before your query?

http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html

Homer6
  • 15,034
  • 11
  • 61
  • 81
  • 1
    It says character, in the [manual](http://dev.mysql.com/doc/refman/5.5/en/char.html) – Question Overflow Nov 13 '11 at 06:50
  • 1 and 6 ? What's your source? – Homer6 Nov 13 '11 at 06:52
  • @pst "UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes" from http://en.wikipedia.org/wiki/UTF-8 – chacham15 Nov 13 '11 at 06:55
  • @ Homer, do you mean I have to `SET NAMES` every time I run a query. If yes, I haven't done so. If no, I have and system variables are showing all `UTF-8`. – Question Overflow Nov 13 '11 at 07:06
  • For each session, yes. Can you try it and see what the result is please? I think you can configure the server so a query before isn't required, if it works. – Homer6 Nov 13 '11 at 07:07
  • @ Homer, I tried `SET NAMES`, there is no change to the above. – Question Overflow Nov 13 '11 at 07:18
  • And what about quadrupling the varchar length? Perhaps the documentation is wrong? – Homer6 Nov 13 '11 at 07:19
  • @ Homer, I think what you suspect may be right. I did a `strlen` on the output string and it gives me 158, which is very close to 156. – Question Overflow Nov 13 '11 at 07:34
  • @ Homer, I tried `SET NAMES` on query editor which may be the reason why it doesn't work. After running `SET NAMES` on PHP script, it worked, but now I am seeing ??? on my web page. Please see my update above. Thanks. – Question Overflow Nov 13 '11 at 10:27
  • @ Homer, I have already set the system variables to UTF-8, why do I still need to use SET NAMES? What can I do to configure the server to make it a permanent feature? – Question Overflow Nov 13 '11 at 10:40
  • 1
    @Ben Huh: The problem is not with the server, but with the client. When the client connects, it chooses the default character set (defined in my.cnf ON THE CLIENT SIDE), which is by default latin1. You must always set the characters set after connecting to the server. See also http://www.php.net/manual/en/function.mysql-set-charset.php – Jakob Egger Nov 13 '11 at 11:14
  • @ Jakob Egger. Thanks. I just found that out too. Now what I did is to skip the client handshake requirement. Everything works smoothly now. – Question Overflow Nov 13 '11 at 11:40
1

It depends what version of MySQL you have. In MySQL 4 and earlier, the length is in bytes. In MySQL 5 and later, the length is in characters.

Also, the column needs to be set to utf8_unicode_ci for MySQL 5 to properly count the number of characters.

Abhi Beckert
  • 32,787
  • 12
  • 83
  • 110
0

Im pretty sure that mb_strlen returns the number of characters, not the size of the string.

Although UTF-8 is 1 byte per ascii character, this is not true for other languages/character sets. The number of characters until the 1930 is about 45. This makes sense because Korean characters take 3 bytes per character (i think)

You must also explicitly set the character set to utf8, see http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html

You can alter the table with: ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;

Run SHOW CREATE TABLE [TABLE_NAME]; to see what character set the column has. I.e. it should print out 'column_name' varchar(156) character set utf8 default NULL,

chacham15
  • 13,719
  • 26
  • 104
  • 207
  • the number 156 in VARCHAR(156) refers to character, not bytes. Therefore I am using `mb_strlen` instead of `strlen`. – Question Overflow Nov 13 '11 at 06:53
  • not generically true, some systems like microsofts sql server have a special data type for utf encoded strings (nvarchar). in mysql you have to explicitly set the character set to utf8. see http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html – chacham15 Nov 13 '11 at 06:57
  • I have stated above that all my encoding and collation related system variables and table collation have been changed to UTF-8, General CI. Is that not sufficient to set the character set in my database? – Question Overflow Nov 13 '11 at 07:03
  • I would check that it actually affected your table, because it sounds like it didn't. I.e. that you're table is actually encoded in utf8 – chacham15 Nov 13 '11 at 07:05
  • `SHOW CREATE` gives me `ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8`, column follows table default. – Question Overflow Nov 13 '11 at 07:24
  • perhaps the table was altered after the column was created? – chacham15 Nov 13 '11 at 07:27
  • You mean a bug? The table was indeed altered from latin1 to utf8 after the column was created. – Question Overflow Nov 13 '11 at 07:37
  • there you go, defaults are applied only on creation. since the table was created with the latin1 default that is the storage type given to the column. – chacham15 Nov 13 '11 at 07:38
  • Yes, I notice that. I manually edit the column collation to utf8 such that the new default is applied. – Question Overflow Nov 13 '11 at 07:40
  • 1
    not the collation, the character set. try recreating the table (try it with explicitly setting the character set like above) – chacham15 Nov 13 '11 at 07:42
  • did you use: `varchar(156) character set utf8`? – chacham15 Nov 13 '11 at 08:52
  • I have explicitly set the table to `utf8` on create, not the column. The column should follow table default upon create as what you mentioned earlier. Are arguing now that I need to explicitly set the column as well? – Question Overflow Nov 13 '11 at 09:04
  • Im saying, try the most certain things to get something to work (including adding extra space as Homer suggested) , then slowly work backwards towards the original solution and see where things break. – chacham15 Nov 13 '11 at 09:27
  • Yes, I did attempt to set the character set in column to utf8, but it is already in default, so it doesn't set. I think Homer may be right that VARCHAR is counting bytes, not characters. – Question Overflow Nov 13 '11 at 09:34