Is there any reason not to use UTF-8, 16, etc. for everything?

Question

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been negligible.

Also, what about in Windows programs, Linux shell and things of that nature -- can you safely use UTF-8 there?

For existing protocols that don't support UTF-8, that's a good reason not to use UTF-8 :) I personally only like to support UTF-8 encoding as it allows unicode characters while allowing my life to revolve around the ASCII character-space (opening up UTF-16 content in a "dumb" editor makes me eyes bleed). — , Jan 15 '11 at 00:05

score 1 · Answer 1 · answered Jan 15 '11 at 00:05

1

When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard.

UTF-8 works well on almost every recent software, even on Windows.

answered Jan 15 '11 at 00:05

Marc-François

3,900
3
28
47

Well, you *can* write UTF-8-based software on Windows (I've done it), but you have to avoid functions like `fopen` that take an "ANSI" string :-( – dan04 Jan 15 '11 at 00:48
What? fopen? In what language? Did I say it was impossible to write software on Windows that is UTF-8 based? I don't understand your point. Or maybe someone deleted his comment. – Marc-François Jan 15 '11 at 06:49
@dan04: you have to convert from Unicode to "ANSI" anyhow for C-level functions. Never assume UTF8 would "magically equal" to valid ANSI. – foo Mar 30 '23 at 14:39

score 1 · Accepted Answer · answered Jan 15 '11 at 00:23

1

If UTF-32 is available, prefer that over the other versions for processing.

If your platform supports UTF-32/UCS-4 Unicode natively - then the "compressed" versions UTF-8 and UTF-16 may be slower, because they use varying numbers of bytes for each character (character sequences), which makes impossible to do a direct lookup in a string by index, while UTF-32 uses 32 bit "flat" for each character, speeding up some string operations a lot.

Of course, if you are programming in a very restricted environment like, say, embedded systems and can be certain there will be only ASCII or ISO 8859-x characters around, ever, then you can chose those charsets for efficiency and speed. But in general, stick with the Unicode Transformation Formats.

answered Jan 15 '11 at 00:23

foo

1,968
1
23
35

4

UTF-32 takes 4x the space of ASCII (or UTF-8 when encoding ASCII characters) for the same data. This can definitely matter. Plus, unlike the "legacy" charsets like ISO-8859-* (and unlike UTF-8), you have byte-order endianness issues with UTF-32 and UTF-16. – dkarp Jan 15 '11 at 02:45
1

["UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint."](http://en.wikipedia.org/wiki/UTF-32/UCS-4) – dkarp Jan 16 '11 at 16:03
1

@dkarp: that's why I wrote "for processing" in the first sentence. For storage, you may want consider storage formats or compression, depending on environment, the speed of the components, the frequency the strings are accessed and other factors. Optimisation is rarely done on one factor alone. -- But the primary factor is, as I wrote, the platform support. Windows, for example, used UTF-16 internally the last time I looked, so going with UTF-16 will be best there, leaving string operation optimisation to the platform/library provider. – foo Jan 17 '11 at 11:07
@foo Sorry, but I don't buy it. If you don't want to do input in UTF-32 and you don't want to do output in UTF-32 and you don't want to store bloated UTF-32 strings in memory, what's the win? UTF-32 isn't even one character/grapheme per 32 bits, it's one *code point* per 32 bits. [Combining characters, canonical equivalence, joy.](http://unicode.org/faq/char_combmark.html) There's a reason that very few platforms and applications use UTF-32 -- the benefits generally do **not** outweigh the costs. – dkarp Jan 17 '11 at 13:57
@dkarp: You are correct about the difference between code points and characters; yet, the troubles with varying run-length holds true, including cache/access speed aspects. So there *are* points for and against. You could call UTF-16 "bloated" as well from a UTF-8/8-Bit-charset perspective; yet many platform makers decided to go with it, probably seeing the best balance of tradeoffs here - Java does it by now, Windows does it by now, Mac OS does, Qt and probably a number more use UTF-16. (Obviously accepting the necessity for byte-order handling). – foo Jan 17 '11 at 19:59
@dkarp: But I've seen Python on Linux using UTF-32, and the "bloat" is reported to be "neglegible", see http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python . Several other *ix platforms prefer UTF-32 as well. So I come back to what I wrote before: Use what your platform provides / prefers - as long as it is an Unicode representation. You Don't Want To Write Unicode Handling Yourself. – foo Jan 17 '11 at 19:59

score 0 · Answer 3 · answered Jun 14 '23 at 21:07

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing.

There is an argument to be made that adding unnecessary conversions is adding complexity for little benefit. So if your inputs and your outputs use the same format then there is an argument for working in that format too.

Both UTF-8 and UTF-16 are relatively well-designed multi-unit encodings. A smaller sequence of code units never appears as a sub-sequence of a longer sequence and a decoder that detects an error can resume decoding at the next valid code unit.

Some argue that UTF-32 is "better" because it uses one code unit for every Unicode code point. What makes this more questionable though is that there is not a 1:1 mapping between unicode code points and what most users would regard as "characters". So being able to rapidly get the nth code point from a sequence is less useful than it would first appear.

Also, what about in Windows programs, Linux shell and things of that nature -- can you safely use UTF-8 there?

Windows and Unix-like systems took different approaches to the introduction of Unicode. Both approaches had their pros and cons.

Windows introduced 16 bit Unicode (initially UCS-2, later UTF-16) by introducing a parallel set of APIs. Applications or frameworks that wanted Unicode support had to switch to the new APIs. This was further complicated by the fact that while windows NT offered Unicode support in all APIs, windows 9x only offered it in a subset.

On the filesystem side, windows NT's native NTFS filesystem used 16 bit unicode filenames from the start. For the FAT filesystem which pre-dated windows NT, Unicode was introduced as part of Long filename support. Similarly for CDs, the Joliet extension added Unicode long filenames.

So did the long filename extensions for FAT, and the Joliet long filename extensions for CDs.

Unix-like systems on the other hand introduced Unicode by using UTF-8 and treating it like any other extended-ascii character set. Filenames on Unix filesystems have always been sequences of bytes, where the meaning assigned to those bytes is down to the user's environment.

There are pros and cons to both approaches. The Unix approach allowed even non unicode aware programs to handle Unicode text to some extent. On the other hand it meant users had to essentially choose between a "Unicode" environment where everything was UTF-8 and where any pre-unicode files would need conversion and a "legacy" environment where Unicode was not supported.

Some programming languages or frameworks will attempt to settle on an encoding and convert everything to that encoding. This is however complicated by the fact that on both Windows and Unix-like systems a program may encounter strings from the operating system that do not pass validation for their nominal encoding. This can happen for a number of reasons, including legacy data from pre-transition software, truncation that does not take account of the multi-unit encodings and use of what are nominally text strings to pass non-text data and just plain old errors.

score 0 · Answer 4 · answered Nov 14 '11 at 15:34

It is well-known that utf-8 works best for file storage and network transport. But people debate whether utf-16/32 are better for processing. One major argument is that utf-16 is still variable length and even utf-32 is still not one code-point per character, so how are they better than utf-8? My opinion is that utf-16 is a very good compromise.

First, characters out side of BMP which need double code-points in utf-16 are extremely rarely used ones. The Chinese characters (also some other Asia characters) in that range are basically dead ones. Ordinary people won't use them at all, except experts use them to digitalize ancient books. So, utf-32 will be a waste most of the time. Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly, as long as your software is not for those special users.

Second, often we need the string memory allocation to be related to character count. e.g. a database string column for 10 characters (assuming we store unicode string in normalized form), which will be 20 bytes for utf-16. In most cases it will work just like that, except in extreme cases it will hold only 5-8 characters. But for utf-8, the common byte length of one character is 1-3 for western languages and 3-5 for Asia languages. Which means we need 10-50 bytes even for the common cases. More data, more processing.

I disagree with "Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly". Saying "My program uses/supports UTF-16" when you mean "My program uses/supports a subset of UTF-16" is either disingenuous or an outright lie. Bugs are one thing; intentionally not supporting the whole of UTF-16 is not a bug. — Kevin, Jul 26 '17 at 22:42

Is there any reason not to use UTF-8, 16, etc. for everything?

4 Answers4

Linked