0

I have set my Charset "utf8 turkish ci" in my MySql database. Because I will store some Turkish characters in my project. I can properly enter Turkish charaters and see them. But my problem is that:

For example, i define "username" as varhar(20) and the maxlenght of inputbox is 20. That means user can't write any username more than 20 characters. But when user uses Turkish unicode characters (like ş,i,ü,ğ) there becomes "Data too long for column 'username'" error, because unicode characters are 2 bytes long!

I tried to update my database with phpmyadmin. But updating the lenght, brings some more errors. so do i have to drop all the tables and write them with x2 lenght? (i mean if data will be 20 char, that i define it varchar(40) ) I have 30 tables and it is a nightmare. Is there any way that i can do?

Caravansary
  • 73
  • 1
  • 1
  • 9
  • https://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Sammitch Jan 23 '18 at 22:52
  • Yes, field is varchar(20) utf8_turkish_ci. I tried to enter 10 unicode characters, there becomes no error. But when i try to enter 11 unicode characters it causes "too long" error. So i see that unicode characters uses 2 bytes. and have to change all my database :( – Caravansary Jan 23 '18 at 22:55
  • utf8_unicode_ci (utf8mb4_unicode_ci) is *such a thing*. More reading: https://stackoverflow.com/a/766996/2960971 https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html – IncredibleHat Jan 23 '18 at 23:00
  • 1
    Where are you seeing this "too long" error, and what is determining it? Is that an error you are actually getting out of mysql, or an error that php is presenting because its doing some sort of preliminary string count? – IncredibleHat Jan 23 '18 at 23:03
  • The error is getting out of mysql. Because on the update codes i put echo mysqli_error($bd). And got it – Caravansary Jan 23 '18 at 23:08
  • You're correct, but I've [made a test](http://sqlfiddle.com/#!9/8aabec/4) and it works for me. I think IncredibleHat has a reasonable line of inquiry as to *what* is saying "too long", and I also have my suspicions that you may have encoding mismatches in the mix as well. – Sammitch Jan 23 '18 at 23:08
  • 1
    I think [this](https://adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/) will help you, `VARCHAR` and `CHAR` lenght definition considers only character lenght regardless bytes per character. Are you using InnoDb? you should take a look in [restrictions](https://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html) – Alvaro Niño Jan 25 '18 at 11:38

2 Answers2

1

MySQL will by default use 3 bytes to store any character for a VARCHAR specified as UTF8 (or 4 bytes for UTF8MB4).

VARCHAR(10) actually does mean 10 characters, 30 bytes. It doesn't mean 10 bytes.

Evert
  • 93,428
  • 18
  • 118
  • 189
  • 1
    No, it does not "use 3 bytes", it "uses _up to_ 3 bytes". English letters take 1 byte; Western European accented letters take 2; etc. The UTF-8 characters for Turkish are 1- or 2- bytes each, except for Lira. – Rick James Jan 24 '18 at 02:01
  • @RickJames you are correct when normally talking about UTF-8, but it's a common confusion that MySQL doesn't treat it this way. So when you specify a column as UTF-8, MySQL absolutely ends up storing it as 3 bytes per character in its underlying storage engine. – Evert Jan 24 '18 at 02:02
  • It's easy to test, but here's a reference: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html – Evert Jan 24 '18 at 02:03
  • As an aside, this is also why it might make sense in some cases (for storage reasons) to use VARBINARY instead. I tend to use this for text-fields that will only contain ASCII and I don't need to search in such as identifiers that end up in urls. – Evert Jan 24 '18 at 02:06
  • One way to think about it is : if a column in MySQL is marked as UTF-8, your mysql client will always see it as a sequence of UTF-8 bytes, but MySQL will still store it as a different fixed-width format. (Maybe UTF-24LE if that was a thing?) – Evert Jan 24 '18 at 02:07
  • With `col` being `CHARACTER SET utf8` (or utf8mb4), `SELECT HEX(col) ...` will demonstrate that MySQL stores English letters as 1 byte/character, Cyrillic as 2 bytes/character, etc. That is, MySQL is not expanding a UTF-8 character into a fixed number of bytes. – Rick James Jan 24 '18 at 02:13
  • 1
    And `VARCHAR(10)...utf8` is implemented as a length field (1 byte) and up to 30 bytes of text. If you store `'A'` into such a column, the space taken is 2 bytes (1 for the length, 1 for the character). Storing the Turkish `'ü'` would take 3 bytes (1 for length, 2 for hex `C3BC`, the utf8 encoding for "small letter u with diaeresis". – Rick James Jan 24 '18 at 02:18
  • `CHAR(10)...utf8`, in the other hand, does take 30 bytes (and no length column). (There is some dispute about whether newer versions of InnoDB turn that into something akin to `VARCHAR`, thereby saving some space. I see that the doc page is wrong here.) – Rick James Jan 24 '18 at 02:20
  • No, I believe you are wrong. Doing anything with MySQL queries is not going to prove this, because MySQL will convert any string from its internal storage format to real UTF-8 before calling a function like `HEX`. From a user's perspective it absolutely looks like UTF-8, just not if you look at the bits on your disk. – Evert Jan 24 '18 at 02:32
  • "One way to think about it" -- That is wrong. The `CHARACTER SET` provided for the column controls the storage in the table (1-4 bytes/character for utf8mb4). The _client_, on the other hand _may_ see a different encoding -- based on `SET NAMES` or equivalent. Best practice is to make them the same. But if different, MySQL will convert the encoding during the insert/select. – Rick James Jan 24 '18 at 02:32
  • 1
    that's an interesting theory. But it is wrong. Or do you have some form of 'proof'? – Rick James Jan 24 '18 at 02:44
  • Maybe run `strings` on some of your data files? – Evert Jan 24 '18 at 03:23
  • Or `od` or `hexdump`. First make sure you create the table with `innodb_file_per_table=ON`. – Rick James Jan 24 '18 at 14:52
  • Oh you expect me to check? Idk... you're challenging my statement here, so you I'm not sure if I want to do your research for you. Nice try though. – Evert Jan 24 '18 at 17:37
  • Actually, I am throwing it out the the group -- https://stackoverflow.com/questions/48426699/how-does-innodb-store-character-columns – Rick James Jan 24 '18 at 20:18
1

I suspect your <form> needs to include the charset: <form accept-charset="UTF-8">

Rick James
  • 135,179
  • 13
  • 127
  • 222