why doesn't EVERYTHING default to UTF-8?

Question

I'm just curious that there are modern systems out there that default to something other than UTF-8. I've had a person block for an entire day on the multiple locations that a mysql system can have different encoding. Very frustrating.

Is there any good reason not to use utf-8 as a default (and storage space seems like not a good reason)? Not trying to be argumentitive, just curious.

thx

Mostly because a lot of "modern" systems aren't actually modern (or at least, have been around for a while), and thus have to worry about backwards compatibility. — Amber, Jun 27 '12 at 03:39
so I'm not saying 'not support' other encodings but it seems like if everything was set to utf-8, a lot of headaches could be avoided. I figure the pain of switching defaults (like in the case of mysql) but I honestly kinda don't get it. — timpone, Jun 27 '12 at 03:44
Hmm.... so when I say systems I'm thinking like mysql or mysql integrated with a web application. Seems like if everything is utf-8, a LOT of headaches go away. I've known some smart people who've gotten bitten by wrong encodings. I'm not trying to be argumentitive - just curious if there's a very good reason. — timpone, Jun 27 '12 at 03:55
UTF-8 has a disadvantage of not all characters being "the same size" (it is possible to get in another 128 codepage-specific characters in an octet, and then it's either onto UTF-8 or UTF-16 or..). — , Jun 27 '12 at 03:56
seems like a pretty rare example. I don't know anyone who has to do that level of optimization. — timpone, Jun 27 '12 at 04:06
UTF-8 is the default on all of my systems. It has been for years and I haven't had any problems. My systems are Linux systems, btw. — Keith, Jun 27 '12 at 04:26
@timpone Yes, but you can make Python default to something else. You can edit /usr/lib/python2.7/site.py to do that. When I say "my systems" I mean my systems that I build. — Keith, Jun 27 '12 at 04:40
@pst: UTF-16 and UTF-32 make all sorts of sense now that memory is cheap, OTOH there is that byte order problem (which can be helped with BOMs but...) — mu is too short, Jun 27 '12 at 04:50
possible duplicate of [Why does modern Perl avoid UTF-8 by default?](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default). Ok, it's talking about Perl, but the answer's the same. The accepted answer has 489 upvotes, and deserves every one. — Andrew Grimm, Jun 27 '12 at 04:55
I actually disagree with closing this question and the reference to the Perl question (my question wasn't about the internal working of a runtime / VM). In many ways, it shows how backwards computer programmers can be in terms of new ideas. — timpone, Jun 27 '12 at 13:15
@muistooshort True. I know it's usually a "premature optimization", but I generally stick with [VAR]CHAR and not N[VAR]CHAR... not very good for internationalization though :( — , Jun 27 '12 at 15:44

score 6 · Answer 1 · answered Jun 27 '12 at 04:51

6

Once upon a time there was no unicode or UTF-8, and disparate encoding schemes were in use throughout the world.

It wasn't until back in 1988 that the initial unicode proposal was issued, with the goal of encoding all the worlds characters in a common encoding.

The first release in 1991 covered many character representations, however, it wasn't until 2006 that Balinese, Cuneiform, N'Ko, Phags-pa, and Phoenician were added.

Until then the Phoenicians, and the others, were unable to represent their language in UTF-8 pissing off many programmers who wondered why everything was not just defaulting to UTF-8.

answered Jun 27 '12 at 04:51

monkut

42,176
24
124
155

1991 was 21 years ago, and with all due respect to the cultures you named, I doubt those are (or were, or will be in the forseeable future) a large enough market for computers/software to block switching to a way more sensible (for the rest of the world) default for twenty years. That's a pretty weak reason. – Jun 27 '12 at 09:17
It takes time to migrate, just because unicode's first release was in 1991 doesn't mean it was instantly adopted, and still isn't fully, which is why we all still have these encoding issues. A lot of existing data is still in encodings other than unicode. – monkut Jun 27 '12 at 09:22
Then explain that, instead of rambling over something which is barely relevant to the answer you apparently intended to give. – Jun 27 '12 at 09:26
2

where'S the fun in that? – monkut Jun 27 '12 at 11:09

score -1 · Answer 2 · answered Jun 27 '12 at 03:41

-1

Some encodings have different byte orders (little and big endian)

answered Jun 27 '12 at 03:41

cmastudios

146
1
3

1

UTF8 standardizes the byte ordering... How does this answer the question? – Eric J. Jun 27 '12 at 03:41
Because some systems might not use the standard byte ordering. You might need to use a different encoding to support that system's byte order. – cmastudios Jun 27 '12 at 03:44
http://unicode.org/faq/utf_bom.html#bom5 wrt BOM in UTF-8: ".. UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream." So a BOM argument my be valid for UTF-16, but UTF-16 is generally not interchangeable with ASCII. – Jun 27 '12 at 03:52
UTF-16 is *not* interchangeable with ASCII, not 'generally' incompatible. ASCII is by definition 7-bit so one byte per character, period. – DaveE Jun 27 '12 at 04:57

why doesn't EVERYTHING default to UTF-8?

2 Answers2