What would happen if all languages began doing strings in UTF-8?

Question

Unicode is awesome. There aren't too many people who disagree with this.

Apart from Python 3 (which did it wrong), what would be the negative impact (if any) of the next major version of all programming languages defaulting to using Unicode/UTF-8 strings?

I'm talking specifically about the many cases which require workarounds to get UTF-8. For example, running a Java program:

java ... -Dfile.encoding=UTF-8

Or working with strings in Python 2:

# -*- coding: utf8 -*-
...
unicode_string = u"This is Unicode Text"

Certain MySQL databases default to a different character encoding by default:

[server]
collation_server=utf8_unicode_ci
character_set_server=utf8

etc. etc.

Why don't we all just default to using Unicode/UTF-8 and allow users to use the workarounds if they need support for other character encodings? What would be the problems with doing this?

Why not add the unicode convention of binary file == text file? You already ask for wild speculation. — Deduplicator, Jul 02 '14 at 19:35
I'm not sure what you're asking, but I'm not dealing with raw bytes and binary data, I'm speaking specifically about strings of text. If someone wants to do raw binary stuff, use something like Java's `byte` class or C's `char`, which should definitely _not_ be Unicode. — Naftuli Kay, Jul 02 '14 at 19:37
Recommended reading: http://stackoverflow.com/a/6163129/1607043 — DPenner1, Jul 03 '14 at 13:54

score 0 · Accepted Answer · answered Jul 02 '14 at 19:05

0

UTF-8 is a variable-length encoding, which is slower to parse than fixed-length encodings. Example: the 7th character of an ASCII string is always the 7th byte. We don't know exactly where the 7th character of a UTF-8 string is in memory without starting from the beginning of the string and parsing the whole thing. For long strings this can be expensive.

So for string operations where finding specific substrings based on character/byte position is important (SQL databases are a great example of this) other encodings can often be preferable.

Additionally, UTF-8 encodes non-english text (outside the ASCII range) as two or more bytes, while a lot of character encodings (koi8-r for Russian, as an example) encode all of the commonly used characters of other languages in a single byte, which is handy for mediums such as email where all the data must be sent over the network.

GB2312 is the primary Chinese character set, which encodes the entire Chinese alphabet in two-byte characters, while all of these characters would be 3 bytes in UTF-8 (50% increase)

UTF-8 is amazing for compatibility, but in terms of how it represents characters in memory, other encodings outcompete it in a lot of scenarios.

answered Jul 02 '14 at 19:05

JessieArr

767
6
11

***1.*** It **does not matter** where the n-th character is, you can iterate them fine anyway (What you described is *extremely rare*). ***2.*** Using an older (legacy in Microspeak) encoding is only useful if the data can carry unambiguous tags (not that BOM mess or other heuristics) for determining the actual encoding. (Well, email qualifies, if you don't use other languages and symbols) ***3.*** The storage-argument is a dead-horse: If you want space efficiency, you'll compress and hardly see the difference. This is an especially bad argument for markup-languages (or chinese). – Deduplicator Jul 02 '14 at 19:20
1- In a database entity with two strings of length N characters, if you don't use a fixed-length encoding then you have to allow the maximum number of bytes per character to accommodate a worst-case scenario (N * 4 for utf-8) 2- Any medium which allows multiple character encodings will also allow metadata to specify which is being used, else it would be useless. 3- It depends on what compression is available to you. Chinese has over 8000 unique characters, meaning two bytes each even in the best-case. RLE compression doesn't work well for Chinese either, to my knowledge. – JessieArr Jul 02 '14 at 21:00
1-You made the wrong decision for what defines length of a text colum2: Should it be glyphs, codepoints or codeunits? Only the last one is sane. 2-Actually not. Sometimes, you are expected to know (or guess right). Especially prevalent for files. 3-Use standard gzip and the like. Anyway, you forgot that dense text is rarely of relevant sizes, in the few cases you'll find it. Backlink to what I said about markup, which is just the most outstanding example. – Deduplicator Jul 02 '14 at 21:17
UTF-16, used by Java and .NET, is a variable-length encoding, too, so variable-length encodings are already used in major languages. On the other hand, many, many Java and .NET programs assume that it's not and satisfy their users that constrain their Unicode usage to the BMP subset. – Tom Blodget Jul 03 '14 at 03:28

What would happen if all languages began doing strings in UTF-8?

1 Answers1