66

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?

Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.

Metropolis
  • 6,542
  • 19
  • 56
  • 86
  • 8
    Good question! mySQL is (or used to be) a swedish company, so that's probably the reason for the swedish part... As to why latin1, I don't know. – Pekka Oct 14 '10 at 17:58
  • @Pekka +1 Ah.....that is interesting. I did not know that. – Metropolis Oct 14 '10 at 17:59
  • Possible duplicate of [Why is MySQL's default collation latin1\_swedish\_ci?](http://stackoverflow.com/questions/6769901/why-is-mysqls-default-collation-latin1-swedish-ci) – Jeff Puckett Jun 14 '16 at 03:39
  • 2
    @JeffPuckettII Except this one was asked first. So that one is a duplicate. – Metropolis Jun 22 '16 at 14:54
  • @Metropolis I'm glad you mentioned that because it was the reason I found this answer: http://meta.stackexchange.com/a/147651/321521 – Jeff Puckett Jun 22 '16 at 15:00
  • @JeffPuckettII Interesting. So what if both have good answers? Seems like it would not always be clear cut on what a better answer is. In this case they may both have good answers to different people. Would be nice if they could be merged somehow. – Metropolis Jun 22 '16 at 15:55
  • @JeffPuckettII Ideally, if a question was asked first, then right when the newer question gets asked, it would be flagged as duplicate before any more questions or answers are added to the newer one. Which would always bring everyone back to the original. – Metropolis Jun 22 '16 at 15:58
  • @Metropolis if you read that answer again, you'll see *"You can flag and ask a moderator to merge after closure if they're exactly the same."* – Jeff Puckett Jun 22 '16 at 16:01
  • @Metropolis ideally, yes, the newer question should have been flagged before it even got an answer, but it didn't, so the duplicate catching system is not good enough yet. – Jeff Puckett Jun 22 '16 at 16:02

5 Answers5

49

As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).

From here:

What 4.0 Did

MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.

As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.

informatik01
  • 16,038
  • 10
  • 74
  • 104
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 2
    i think they maby choose this rather odd collocation to make it obvious to the user that it shold be changed. which of course in most times was did not turn out as expected but was prevented by the tyranny of the default :) – The Surrican Apr 19 '13 at 09:42
  • 2
    @TheSurrican, What a strange answer. What makes this an odd collation? It's the Swedish version of standard latin1 chosen by a Swedish company. It's just like Oracle choosing US English for their products. – chrismacp Feb 20 '16 at 14:54
  • 2
    How about latin1_swedish_ci being ISO 8859-1 and ISO 8859-1 is the first of the available choices when sorted, so if you don't specify any choice, the – zeachco Sep 26 '16 at 16:03
6

latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions.

from

http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html

Might help you understand why.

cameck
  • 2,058
  • 20
  • 32
bear
  • 11,364
  • 26
  • 77
  • 129
  • 4
    Yeah, but the question is why is this the default character set and not the incredibly more versatile UTF-8? – Pekka Oct 14 '10 at 18:00
  • I know what his question was. I can only suggest that there were limitations or it wasn't used widely, or was somewhat not as popular at the time. – bear Oct 14 '10 at 18:08
  • 1
    @Pekka웃 That's because as wonderful as UTF-8 is it is still **multi-byte** and worse variable length multi-byte. And that's a death-knell for extremely simplistic programs. I don't think anyone ever woke up in a cold-sweat worrying about 5 and 7 byte `latin1` characters. Of course this may only apply to the past. *was* not *is*... – ebyrob Jul 31 '17 at 16:23
  • @ebyrob true - but arguably those days are so far past that *they* should be the special case, rather than UTF-8 which these days, is the household encoding for new projects. – Pekka Jul 31 '17 at 19:33
  • 1
    @Pekka웃 Unfortunately I kind of understand Oracle's lack of any forward progress in MySQL globally. I'm a bit dumbfounded however by MariaDB not making the switch, though they do feature it prominently in their documentation: https://mariadb.com/kb/en/mariadb/setting-character-sets-and-collations/#example-changing-the-default-character-set-to-utf-8 – ebyrob Jul 31 '17 at 20:24
2

Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.

AndreKR
  • 32,613
  • 18
  • 106
  • 168
0

Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.

Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
0

To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).

In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.

Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

antus
  • 1