8

I'm just curious that there are modern systems out there that default to something other than UTF-8. I've had a person block for an entire day on the multiple locations that a mysql system can have different encoding. Very frustrating.

Is there any good reason not to use utf-8 as a default (and storage space seems like not a good reason)? Not trying to be argumentitive, just curious.

thx

timpone
  • 19,235
  • 36
  • 121
  • 211
  • 4
    Mostly because a lot of "modern" systems aren't actually modern (or at least, have been around for a while), and thus have to worry about backwards compatibility. – Amber Jun 27 '12 at 03:39
  • so I'm not saying 'not support' other encodings but it seems like if everything was set to utf-8, a lot of headaches could be avoided. I figure the pain of switching defaults (like in the case of mysql) but I honestly kinda don't get it. – timpone Jun 27 '12 at 03:44
  • 1
    Switching defaults is a *huge* pain. – Amber Jun 27 '12 at 03:48
  • Hmm.... so when I say systems I'm thinking like mysql or mysql integrated with a web application. Seems like if everything is utf-8, a LOT of headaches go away. I've known some smart people who've gotten bitten by wrong encodings. I'm not trying to be argumentitive - just curious if there's a very good reason. – timpone Jun 27 '12 at 03:55
  • UTF-8 has a disadvantage of not all characters being "the same size" (it is possible to get in another 128 codepage-specific characters in an octet, and then it's either onto UTF-8 or UTF-16 or..). –  Jun 27 '12 at 03:56
  • seems like a pretty rare example. I don't know anyone who has to do that level of optimization. – timpone Jun 27 '12 at 04:06
  • 1
    This should really be in programmers.stackexchange.com – Burhan Khalid Jun 27 '12 at 04:25
  • UTF-8 is the default on all of my systems. It has been for years and I haven't had any problems. My systems are Linux systems, btw. – Keith Jun 27 '12 at 04:26
  • @timpone Yes, but you can make Python default to something else. You can edit /usr/lib/python2.7/site.py to do that. When I say "my systems" I mean my systems that I build. – Keith Jun 27 '12 at 04:40
  • @pst: UTF-16 and UTF-32 make all sorts of sense now that memory is cheap, OTOH there is that byte order problem (which can be helped with BOMs but...) – mu is too short Jun 27 '12 at 04:50
  • possible duplicate of [Why does modern Perl avoid UTF-8 by default?](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default). Ok, it's talking about Perl, but the answer's the same. The accepted answer has 489 upvotes, and deserves every one. – Andrew Grimm Jun 27 '12 at 04:55
  • I actually disagree with closing this question and the reference to the Perl question (my question wasn't about the internal working of a runtime / VM). In many ways, it shows how backwards computer programmers can be in terms of new ideas. – timpone Jun 27 '12 at 13:15
  • @muistooshort True. I know it's usually a "premature optimization", but I generally stick with [VAR]CHAR and not N[VAR]CHAR... not very good for internationalization though :( –  Jun 27 '12 at 15:44

2 Answers2

6

Once upon a time there was no unicode or UTF-8, and disparate encoding schemes were in use throughout the world.

It wasn't until back in 1988 that the initial unicode proposal was issued, with the goal of encoding all the worlds characters in a common encoding.

The first release in 1991 covered many character representations, however, it wasn't until 2006 that Balinese, Cuneiform, N'Ko, Phags-pa, and Phoenician were added.

Until then the Phoenicians, and the others, were unable to represent their language in UTF-8 pissing off many programmers who wondered why everything was not just defaulting to UTF-8.

monkut
  • 42,176
  • 24
  • 124
  • 155
  • 1991 was 21 years ago, and with all due respect to the cultures you named, I doubt those are (or were, or will be in the forseeable future) a large enough market for computers/software to block switching to a way more sensible (for the rest of the world) default for twenty years. That's a pretty weak reason. –  Jun 27 '12 at 09:17
  • It takes time to migrate, just because unicode's first release was in 1991 doesn't mean it was instantly adopted, and still isn't fully, which is why we all still have these encoding issues. A lot of existing data is still in encodings other than unicode. – monkut Jun 27 '12 at 09:22
  • Then explain that, instead of rambling over something which is barely relevant to the answer you apparently intended to give. –  Jun 27 '12 at 09:26
  • 2
    where'S the fun in that? – monkut Jun 27 '12 at 11:09
-1

Some encodings have different byte orders (little and big endian)

cmastudios
  • 146
  • 1
  • 3
  • 1
    UTF8 standardizes the byte ordering... How does this answer the question? – Eric J. Jun 27 '12 at 03:41
  • Because some systems might not use the standard byte ordering. You might need to use a different encoding to support that system's byte order. – cmastudios Jun 27 '12 at 03:44
  • http://unicode.org/faq/utf_bom.html#bom5 wrt BOM in UTF-8: ".. UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream." So a BOM argument my be valid for UTF-16, but UTF-16 is generally not interchangeable with ASCII. –  Jun 27 '12 at 03:52
  • UTF-16 is *not* interchangeable with ASCII, not 'generally' incompatible. ASCII is by definition 7-bit so one byte per character, period. – DaveE Jun 27 '12 at 04:57