2

We currently have about 30 tables that are using latin1_sweedish_ci for their encodings. I realize though that utf8_unicode_ci is a much safer encoding to use, for which I would like to convert all of these tables over to.

My questions are,

  1. Can I convert all of these tables over to utf8_unicode_ci without causing any data corruption.
  2. Is there a copelling reason to do this if sweedish has been working fine up till this point?
  3. Is it okay to start using utf8_unicode_ci on newly created tables?
Metropolis
  • 6,542
  • 19
  • 56
  • 86
  • possible duplicate of [UTF-8 all the way through](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through) – Pekka Aug 25 '11 at 19:58
  • 1
    That post does not answer the question on existing tables. So I do not see how this post should be closed. I do not have the option of starting from scratch, and that post does not answer that question. – Metropolis Aug 26 '11 at 13:32

1 Answers1

1

Can I convert all of these tables over to utf8_unicode_ci without causing any data corruption.

If you are working with pure mySQL (i.e. you have no web application actually using the tables), this is likely going to be fine. Switching table collations will convert the data as long as this is possible (it would not necessarily be if converting the other way round, from UTF-8 to latin1, because UTF-8's range is much broader and has many characters that latin1 can't display).

If one or more applications are accessing the tables, it wil all depend on them and whether they can fully deal with UTF-8 data. Many PHP applications, for example, can't.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Hey Pekka! Thanks for the response. What if I take the application off line so that it is not accessing it while I convert all of them? – Metropolis Aug 25 '11 at 19:50
  • @Metropolis it doesn't matter, the issues (if there are any) come later. UTF-8 is a multi-byte encoding, with each character taking up 1 to 3 bytes. Latin1 is a single-byte encoding, where each character takes up exactly one byte. PHP's native string functions, for example, are not multi-byte safe, so there is a chance of the application breaking data later. Other languages like Python have native Unicode support, which causes a lot less problems. It really depends on your application. – Pekka Aug 25 '11 at 19:52
  • Huh.....So if I converted them now, the data would be okay....but as PHP used those tables in the future they would get messed up? I hope I am understanding you correctly. – Metropolis Aug 25 '11 at 19:55
  • @Metropolis they *might* get messed up if the application is using string operations that aren't multi-byte safe. This was the original idea of PHP 6: [Declaration to make PHP script completely Unicode-friendly](http://stackoverflow.com/q/5765093) but that has been cancelled. Pascal's answer there has links to the most common workarounds - it's not impossible to port an application to UTF8 at all, it just is a bit more work than just changing the tables. – Pekka Aug 25 '11 at 19:56
  • So basically, if you start a PHP application using mysql tables encoded with utf8_unicode_ci, you better make sure the string functions that you use are multi-byte supported throughout that application. Otherwise, if you convert them, you should go through your entire program and make sure what your using is supported. – Metropolis Aug 25 '11 at 19:58
  • @Metropolis exactly. See the duplicate link I posted above, it has a brilliant answer. – Pekka Aug 25 '11 at 19:59
  • 2
    @Pekka웃 Just "to be accurate" for people reading this thread of comments, UTF-8 character can go up to 4 bytes. This, btw, is a problem when using MySQL under 5.3 as it can't store them. With MySQL5.3 and higher, please use `utf8mb4_unicode_ci` instead. – jeromej Jul 28 '13 at 22:32