9

Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.


Examples

Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.

C#: Didn't find a correct reverse method, Length and indexed access return wrong results.

Perl: Same problem.

PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.


I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve such a thing?

  • More complex algorithms?
  • Higher memory consumption?
  • Slower performance?

How was it implemented internally?

  • Array of Ints, Linked Lists, etc.
  • Additional buffering

I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?

soc
  • 27,983
  • 20
  • 111
  • 215
  • 3
    The Java examples are correct because it is documented that all operations work on code units. – Philipp Jul 24 '10 at 13:56
  • I also find it hard to understand why you think these implementations aren't "correct". The corner cases are all documented and the frameworks all have special-purpose classes or methods to handle those. It sounds like asking for a language where every possible corner case is automatically handled by a primitive type and primitive operations? – Aaronaught Jul 24 '10 at 14:16
  • @Aaronaught: No basically exactly the opposite thing. I'm wondering if there is a language where string operations are abstracted *away* from any implementation detail like their representation in memory. Basically a language which returns the correct results for methods like length instead of insisting that people have to keep the Unicode manual and the platform-specific implementation in their head. – soc Jul 24 '10 at 14:32
  • Java’s `length` function returns the numbers of code units in the string, and it is correct because its behavior is documented and the implementation complies to the documentation. *Length* doesn't mean *number of code points*. – Philipp Jul 24 '10 at 14:43
  • 2
    @Philipp: I really don't want to argue with you, but do you really believe even 0.5% of developers are aware of that? Of course the documentation was adapted after UTF-16 was introduced, but that doesn't necessarily change what people expect. The current situation is a compromise between backward compatibility and necessity to support UTF. I think if the Java developers had UTF-16 on their radar when they designed Java, the implementation would look different today (with a "correct" result for length). – soc Jul 24 '10 at 15:46
  • @soc: No documentation or implementation can prevent mistakes made by programmers. I think the Java documentation is fairly explicit about this potential issue (it is mentioned right at the top of the `String` class reference). Saying that `length` 's result is incorrect because it counts code units is similar to saying that C's addition operator is incorrect because it uses modulo arithmetic—both are surprising to newcomers, but both are well-documented and programmers just have to know about them. In 2010, you definitely have to know that UTF-16 is a variable-width encoding, no excuse... – Philipp Jul 24 '10 at 15:58
  • ...Even if Java and the others were using UTF-32 exclusively, you still would have to deal with UTF-16 because it is used by Windows, Cocoa, OpenType... – Philipp Jul 24 '10 at 15:59
  • 3
    This is wrong. Perl does not present separate code units to the programmer the way these dumb UTF-16 languages do. Indexing is always by code point, never by code unit. And @Phillip, it is misleading and stupid that Java uses `length` to mean code units not code points. It is a design bug. You cannot excuse it this way: it is still stoooopid. – tchrist Aug 22 '11 at 20:53

8 Answers8

10

The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:

  • Does the system have a string datatype that uses a Unicode encoding?
  • Are all Unicode (UTF) encodings supported that are described in the standard?
  • Normalization
  • The Bidirectional Algorithm
  • Is UpperCase("ß") = "SS"?
  • Is upper-casing locale sensitive? (e.g. in Turkish, UpperCase("i") = "İ")
  • Are there functions to work with code points instead of code units?
  • Unicode regular expressions
  • Does the system raise exceptions when invalid code unit sequences are encountered during decoding?
  • Access to Unicode Database properties?

I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”

Philipp
  • 48,066
  • 12
  • 84
  • 109
  • I just tried Python after testing Java/Scala/C# and IMHO at least Python doesn't fail at the most basic operations like reversing a string or accessing a character by index. – soc Jul 24 '10 at 14:17
  • Again, there is no failure. All programming languages index by code units because indexing by code points is not a `O(1)` operation unless you use UTF-32. Python just happens to use UTF-32 if some compile-time constant is used (which is the case on many Unix-like systems). – Philipp Jul 24 '10 at 14:31
  • @Philipp: Thanks! That comment was very helpful! That's exactly the thing I was interested in. Basically Python chose to have higher memory usage by using arrays of ints instead of arrays of chars to represent texts in memory? – soc Jul 24 '10 at 14:35
  • 1
    @soc: Usually I avoid terms such as "char" or "int" because they are not well-defined and misleading (the C datatype `char` is nothing but some integer of unspecified width and signedness, not anything Unicode would call a "character"). Any string is a sequence of code units. An UTF-*x* (where *x* is one of 8, 16 or 32) code unit is a number which can be represented by a fixed-size integer of with *x* (in bits). Implementations usually use a 16-bit unsigned integer for UTF-16 code units, and a 32-bit signed or unsigned integer for UTF-32 code units. – Philipp Jul 24 '10 at 14:49
  • Using UTF-32 always wastes some space because Unicode defines just 10FFFFh code points, while a 32-bit integer can store up to 100000000h different values. But I've never seen a program run out of virtual memory because of using too many UTF-32 strings. – Philipp Jul 24 '10 at 14:51
  • @Philipp: That's a good point. So basically there is a choice to 1) implement Strings with a fixed length encoding like UTF-32 and having constant index access/easy counting, or 2a) implementing Strings with a variable length encoding and having either inconvenient results for length/indexed access or 2b) O(n) for length/indexed access or 2c) adding some buffering structure to Strings which remembers how many characters where in the e. g. first 512 bytes to amortize length/indexed access time. – soc Jul 24 '10 at 15:58
  • @Yes. I don't know of any implemenation that uses 2b or 2c, but you can easily build an *iterator* that iterates over code points even if you use an UTF-8 or UTF-16 string (iterating will then be O(*n*) for both code unit and code point iteration). An example for this is ICU's `CharacterIterator` (ICU uses UTF-16 strings). Often iteration is more important than indexed access. – Philipp Jul 24 '10 at 16:05
  • **Perl does all of those things.** See [my Unicode talks](http://training.perl.com/OSCON2011/index.html). – tchrist Aug 22 '11 at 20:55
  • @Phillip: Contrary to your assertion, Java and .NET **do not** give access to the properties from the Unicode Character Database. Check out their character classes or their regex syntax. You have access to little more than General Category and perhaps if you're lucky, Script. [Here are the UCD properties](http://unicode.org/reports/tr44/#Property_Index), virtually none of which can you get at from Java or .NET, and all of which you can trivially get at from Perl. Also, .NET does **not** do [Unicode regexes](http://unicode.org/reports/tr18/); Perl does. Please do not misrepresent things. – tchrist Aug 23 '11 at 12:47
8

It looks like Perl 6 gets good Unicode support:

perlgeek.de/en/article/5-to-6#post_17

For instance it provides you with three different length methods:

  • bytes (amount of bytes)
  • codes (amount of codepoints)
  • graphs (amount of graphemes)

This gets integrated into Perl's regular expressions as well.

Looks like a step into the right direction to me.

soc
  • 27,983
  • 20
  • 111
  • 215
  • Yes, for Perl this is a step into the right direction but I have a strong feeling that this could not be the right direction for you. Targeting for a unreleased version of [dying language](http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html) that many people call a write-only language? Python 3 has very good Unicode support, is stable and has more funding behind. – sorin Jul 31 '10 at 07:34
  • 3
    Although I don't really like the "wordiness" of the Perl language at least Perl is not going back in time like Python. After seeing how the whole GIL debate repeats again with TCO I completely lost my believe in any sensible leadership of Python. It is shameful to see how many people of the Python community react to these problems by calling others "brainwashed", deny even the existence of a problem or call the current situation the best - despite glaring evidence from the real world as well as from academia. – soc Jul 31 '10 at 23:16
  • 1
    Perl5 has the best Unicode of any extant language. The Perl6 spec is for grapheme support, so you can index by grapheme. In Perl5, you have to use the `Unicode::GCString` class for that. – tchrist Aug 22 '11 at 20:58
7

Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).

  • Is there an online interpreter to test it? My distribution doesn't have an Go packages yet :-/ – soc Jul 24 '10 at 14:22
  • 1
    Rob Pike also had a hand in UTF-8 and it is an [interesting story](http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt). – Chris May 07 '11 at 17:00
  • 2
    UTF-8 is not unicode. To my knowledge, Go does not have multiple codepoint folding in the standard library, and neither it has the notion of eXtended grapheme clusters. – user7610 Aug 01 '13 at 21:06
5

In Python 3, strings are always unicode (there is bytes for ASCII or similar encodings). I'm not aware of any built-ins not working correctly with them. There may be some, but considered it is out for quite a while, I figure they got about everything needed daily working.

Of course Unicode has higher memory comsumption (UTF-8 not really if you stay within ASCII range, but else...) and I can imagine multiple-length encodings are a pain to handle internally. I don't know anything about the implementation, though. Except that it can't be a linked list, since it has O(1) random access.

  • Do you know what Python does internally? Do they store text as arrays of integers or longs? Or do they just have more complex algorithms and use something more naive in the background? Any idea how indexed access works? – soc Jul 24 '10 at 14:20
  • Python is no different than most other languages. Strings are implemented either as UTF-16 or UTF-32 strings using simple flat arrays, as determined at compile time, and always work on code units (everything else would be too inefficient). My Python 3.1 still gives `len("") = 2`. – Philipp Jul 24 '10 at 14:28
  • @Philipp Thats weird, my python 3.1 handles len's correctly, `>>> len("水")` 1 `>>> len("А")` 1 `>>> len("")` 1 – Daniel Kluev Jul 24 '10 at 14:35
  • 1
    As I've said, Python can be compiled to use UTF-16 *or* UTF-32. If it is compiled to use UTF-16 (this is the default on Windows and when you compile from source as I did), then `len("") = 2` because Python's `str` datatype indexes by code units, not code points. – Philipp Jul 24 '10 at 14:45
  • Interestingly Python's Unicode HOWTO (http://docs.python.org/py3k/howto/unicode.html#encodings) is fairly oblivious of Python's Unicode implementation itself—“Generally people don’t use this encoding [UTF-32]…” and “There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.”, while Python itself uses UTF-32 or UTF-16 exclusively for internal processing. – Philipp Jul 24 '10 at 16:10
  • You have to use Python 3, and you have to use a UCS-4 build, before you can start to have a usable Python system under Unicode. You cannot use Python 2, and you cannot use a narrow build. You should abort your program if those two requirements are unmet. – tchrist Aug 22 '11 at 21:00
5

Thought this is 10 years old question,...

Yes. Swift does.

  • Basic string type String performs all character handling at Unicode "Grapheme Cluster" level. Therefore you are enforced to perform every text mutating operations in "Unicode-correct" manner at "human-perceived character" level.

  • The String type is abstracted data type and does not expose its internal representations, but it has interfaces to access Unicode Scalar Values and Unicode Code Units for all of UTF-8, UTF-16, UTF-32 encodings.

  • It also stores breadcrumbs to provide offset conversion between UTF-8 and UTF-16 in amortized O(1) time.

  • Character type also provide decomposition into Unicode Scalar Values.

  • Character type has multiple character classification methods that are based on Unicode semantics. For example, Character.isNewline returns true for all new-lines strings including LF,VT,FF,CR,CR-LF,NEL, ... that are defined in Unicode standard.

  • Though it's abstracted, Swift 5.x internally stores strings in UTF-8 encoded form by default. It's possible to access them in strict O(1) time so you can use UTF-8 based functions without sacrificing performance.

  • "Unicode" in Swift covers "all" characters defined in Unicode standard and not limited to BMP.

  • String, Character and all of their derived view types like UTF8View, UTF16View, UnicodeScalarView conform BidirectionalCollection protocol, so you can iterate components bi-directionally in all supported segmentation levels. They all share same index type so indices obtained from one view can be used on another view if they points correct Grapheme Cluster boundaries.

eonil
  • 83,476
  • 81
  • 317
  • 516
1

The .NET Framework stores char and string data using the UTF-16 encoding. If you assume that all your text lies within the Basic Multilingual Plane, then everything will just work without any special code.

If you regard user-entered strings as blobs and don't try to manipulate them (e.g. most text fields in CRUD apps), then your code will appear to handle characters outside the BMP correctly, because UTF-16 stores them as surrogate pairs. As long as you don't fiddle with the surrogate pairs, then all will be fine.

However, if you want to analyse and manipulate strings while also handling characters outside the BMP correctly, then you have to explicitly code for that possibility. See the StringInfo class for methods to help you process surrogate pairs.

I would guess that Microsoft designed it this way to achieve a balance between performance and correctness. The alternatives would be:

  • Store strings as UTF-32 - poor performance in terms of memory use
  • Make all string functions handle surrogate pairs - very poor performance for manipulation

.NET also contains full support for culture-aware case conversion, comparisons and sorting.

Christian Hayter
  • 30,581
  • 6
  • 72
  • 99
  • Assuming that code that works on the BMP only is somehow tolerable or nonbuggy is absolutely horrible. Having an API that indexes, substrings, and gives the length only by code unit instead of by code points is a trainwreck. Tell me, in the regex engine, how many code units does `.` match — or is that how many code points? Even better: how well does the pattern `[-]` match the string `""`? Does it? That had better not expand to the UCS-2 `[\uD835\uDC9C-\uD835\uDCB5]`, or else you’re out of spec. Pretending that it’s ok to treat code units as code points absolutely sucks horrible rocks. – tchrist Aug 23 '11 at 12:32
  • @Christian: I'm sensitive to all this after lately doing a bunch of work in Java 6 w/o ICU, plus some in Python2 under narrow builds. This mess stems from too quickly adopting Unicode v1 where characters were a fixed 16 bits, but then **forever neglecting to update their APSs & character models** to cope with Unicode v2⁺ from 1996–2011, whose characters are now 21 bits wide. UTF‑16 can be workable internally, but **letting serialization violate the abstraction layer is a major flaw & serious burden.** That’s why Go & Perl are superior to Java, C♯, Javascript, & (most)Python for Unicode work. – tchrist Aug 23 '11 at 14:22
0

I believe that any language supported on the .NET framework has correct unicode (UTF-16) support.

Also, similar question here

Community
  • 1
  • 1
Peter Kelly
  • 14,253
  • 6
  • 54
  • 63
  • .NET has the same issue as Java. From http://msdn.microsoft.com/en-us/library/system.string.length.aspx: The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char. – Simon Nickerson Jul 24 '10 at 14:03
  • Thanks for the tip - I didn't know that. But is the fact that there is alternative System.Globalization.StringInfo available mean that it is in fact implemented correctly albeit not in the Length property? – Peter Kelly Jul 24 '10 at 14:18
  • There is no issue, and the `Length` property is implemented correctly. Indexing works on UTF-16 code units, and there is nothing wrong with that. Knowing how many code points a string contains is of minor interest—`ä` is one code point while `ä` is two, do you see the difference? These two strings are canonically equivalent, yet their length (measured in code points) is different. What the user sees are graphemes or grapheme clusters, which can consist of more than one code point. – Philipp Jul 24 '10 at 14:36
  • @Philipp It's of minor interest because most of us makes programs for western languages mostly. Consider something as common as validating - maybe you need to validate that a username is atleast 5 "characters" long. People mostly just do `if(userName.Length) < 5 ) return false;` Which is not really what you want, you want the number of code points - it just happens to work because most languages uses no more than one utf-16 code unit – nos Jul 24 '10 at 15:30
  • @nos: Well then I use `ẹ̷̇‌‌‌` which is as least five characters (three of which are zero-width non-joiners), or even simply five ZWNJs. If you want to do user name validation, you have to do a lot more than counting code units (e.g., counting characters of category L). – Philipp Jul 24 '10 at 15:53
  • @nos: So validation code that just uses 'Length' to check for "at least x characters long..." is actually poor practice with regards to globalization? @Philipp: what do you mean by counting category L characters? Do you mean counting literal characters? – Peter Kelly Jul 24 '10 at 16:07
  • 1
    "Correct UTF-16 support" is very different from allowing code point access. See the ICU libraries for the right way to do this. UTF-16 is horrendous, but it's even worse without a proper code point interface. Dot Net is screwed up. – tchrist Aug 22 '11 at 20:59
0

DigitalMars D has datatype dstring which uses UTF32 codepoints, should be enough for most cases.

Target-san
  • 451
  • 3
  • 11
  • Using a specific transformation format is maybe 3% of the Unicode standard. What about the rest? – soc Aug 23 '11 at 09:35