How well does your language support unicode in practice?

Question

I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new project.

I tend to find Java too verbose and messy, and my not wanting to touch Windows with a 6-foot pole tends to rule out .Net. That leaves essentially everything else -- except PHP, C and C++ (the latter two of which I know get messy with unicode stuff irrespective of the ICU library).

I've short listed a few languages to date, namely Ruby (loved the mixins), Python, Lisp and Javascript (node.js). However, I'm coming with highly inconsistent information on unicode support and I'm dreading (lack of time...) to learn each and every one of them to the point where I can safely break it to rule it out.

In so far as I understood, Python 3 seems to have it. As does Ruby 1.9. Lisp not necessarily. Javascript presumably.

There's arguably more than unicode support to a language, but in my experience it tends to become a major drawback when dealing with locale.

I also realize the question is somewhat subjective. (Please don't close it on that grounds: I'm actually linking to several SO threads which I found unsatisfying.) But... as a user of any of these languages, how well do they support unicode in practice?

Totally disagree with you regarding Java. Node.js is Unicode-supporting not as a specific implementation but because javascript as a language is (it's part of the ECMAScript standard) so it should be smooth there. I feel that you may not be picking the right tool for the job if you're putting such emphasis on string support... — davin, Jun 09 '11 at 00:51
@Davin: I've this thing with java... I probably was traumatized by how boring it could be to code in it at university. And yes, I agree that node.js is a good candidate in spite of it and its libraries being quite new. But I'm still keen to investigate the other options, because node.js (possibly wrongly) didn't strike me as being settled/mature enough. — Denis de Bernardy, Jun 09 '11 at 01:08
Don't worry about it. The only people who enjoy java are those suffering from stockholm syndrome. — ryeguy, Jun 09 '11 at 01:20
@ryeguy: As extremely scary as it sounds, I can actually relate to that somehow -- with PHP it's much the same, and I sure as hell know why I haven't touched a line of PHP in over a year. — Denis de Bernardy, Jun 09 '11 at 01:24

score 7 · Accepted Answer · answered Jun 09 '11 at 00:53

Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str), whereas 2.x has bytestrings (str, "...") and unicode strings (unicode, u"...") that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.

That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.

It's true that Unicode *support* didn't really change that much between 2.6 and 3.0, but what matters is whether people *use* the feature. Trivial as it was, that little `u` prefix was a barrier preventing people from using Unicode strings, so it had to go. — dan04, Jun 09 '11 at 02:03
The `u''` really isn't the problem. It's the implicit conversion when mixing bytestrings and unicode that is the big thing. — Thomas Wouters, Jun 09 '11 at 10:07
Thanks... Do you know if there is a means to configure it to use an explicit locale/set of collation rules (so that string comparisons respect the latter). In Postgres, for instance, I can write: `SELECT col1 < col2 COLLATE "de_DE" FROM test;`. — Denis de Bernardy, Jun 09 '11 at 11:02
The unicode type itself compares by unicode codepoint number. If you want different (specific) collation rules, the usual thing to do is to use PyICU, which wraps IBM's icu library. — Thomas Wouters, Jun 09 '11 at 11:49

score 6 · Answer 2 · answered Jun 09 '11 at 00:53

6

Perl has excellent support of unicode. You need to know how to use is properly, but i never find any language what has better unicode support than perl, especially now with perl5.14.

answered Jun 09 '11 at 00:53

clt60

62,119
17
107
194

2

Does it really? I was expecting it did, but I beg to ask, because [this answer on SO](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129) kind of sent chills through my spine and pretty much everywhere else... :-( – Denis de Bernardy Jun 09 '11 at 00:54
2

Read this http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default carefully (really carefully). When do you use perl/unicode you must follow few rules. When you follow them, you will get extremely competent and full unicode support, with character names, conversions, collation, NFC/NFD and etc. IMHO it is not comparable with any other language... – clt60 Jun 09 '11 at 01:06
2

Yeah, that's the post I linked to and this is my whole point. I read it quite in depth before deciding not to mention it as an option in my question. I seriously considered it, but the accepted answer made it sound like it was a heck of a lot of hassle (as compared to Postgres, for instance, which "just works"). I may have misunderstood, though... – Denis de Bernardy Jun 09 '11 at 01:12
1

in the mentioned answer, tchrist pointed out really much problems what come from unicode itself. Much languages does not deal with then. Perl yes, but the price is here. Need to know, how to use it properly. Some languages simply does not document thins, and the user think - the support is ok - and later discover the truth. :) For example, how to deal languages with NFD filename conversions? Simply try ask specific questions from tchist's answer in other languages - and you will see yourself. ;) – clt60 Jun 09 '11 at 01:19
Fair enough. Voting you up. :-) – Denis de Bernardy Jun 09 '11 at 01:22
:) "just works". Mean "Roman script"? what you get when will convert lowercase sigma into uppercase and back? (casefolding). here *are* problems - but in most of sw does not talking about. Consider that the uc("σ") and uc("ς") are both "Σ", but lc("Σ") cannot possibly return both of those. (uc - convert to uppercase, lc-conver to lowercase) - as tchirst pointed out. ;) – clt60 Jun 09 '11 at 01:26
`postgres=# select lower('Σ'); lower ------- σ ` But I'll grant you this (irreversable) one: `postgres=# select lower(upper('ς')); lower ------- σ` – Denis de Bernardy Jun 09 '11 at 01:27
for the fairness must tell. Here are modules in CPAN what are simply too old and does not support unicode correctly. But that is not a issue of the language itself. ;) – clt60 Jun 09 '11 at 01:30
;) so, as you see - you got one of two possibilities. :) (and not Σ -> ς) btw, perl -CSDA -Mutf8 -e 'print lc("Σ");' gave "σ" too. :) – clt60 Jun 09 '11 at 01:32
Yeah, but it's actually documented in the ICU library (amongst other points) that some transliterations/transformations are irreversible, e.g. `latin f -> greek phi -> latin ph`, or whatever their example was when I read it. So your specific example is (as we say in France) pulled by the hair. ;-) Regardless, I'll take your word for it that I should take a closer look at perl too (which I barely know beyond copy/pasting a few lines here and there). – Denis de Bernardy Jun 09 '11 at 01:36
1

welcome in perl :) Have you caught, you will never leave. :) :) – clt60 Jun 09 '11 at 01:43
My experience with it is that Perl is the most immediate way (C and Perl-compatible regular expressions being very close runner ups) to obfuscate code beyond the point where it's readable. As the joke goes: "If I had a hard time coding it, they should have a hard time decoding it." :-D – Denis de Bernardy Jun 09 '11 at 01:53
By the way... Is there a means to configure it to use an explicit locale/set of collation rules (so that string comparisons respect the latter). In Postgres, for instance, I can write: `SELECT col1 < col2 COLLATE "de_DE" FROM test;`. – Denis de Bernardy Jun 09 '11 at 11:02
1

http://perldoc.perl.org/Unicode/Collate.html This module is an implementation of Unicode Technical Standard #10 (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA). and this http://search.cpan.org/~sadahiro/Unicode-Collate-0.76-withoutworldwriteables/Collate/Locale.pm for locale exteded collate. – clt60 Jun 09 '11 at 11:08

score 3 · Answer 3 · answered Jun 09 '11 at 02:47

3

Racket (in the Lisp/Scheme camp) has good Unicode support. Racket distinguishes character strings (written "abc") from byte strings (written #"abc"). Character strings consist of Unicode characters and have all the Unicode-aware string operations one would expect (comparison, case folding, etc). By default Racket uses UTF-8 for character string I/O (including the encoding of source files), but it also supports conversion to and from other encodings. The GUI toolkit works with Unicode. So do regular expressions.

answered Jun 09 '11 at 02:47

Ryan Culpepper

10,495
4
31
30

Thanks... Do you know if there is a means to configure it to use an explicit locale/set of collation rules (so that string comparisons respect the latter). In Postgres, for instance, I can write: `SELECT col1 < col2 COLLATE "de_DE" FROM test;`. – Denis de Bernardy Jun 09 '11 at 11:01
@Denis, yes, there are locale-sensitive comparison functions and a way to set the locale. See http://docs.racket-lang.org/reference/encodings.html#%28tech._locale%29 – Ryan Culpepper Jun 09 '11 at 19:04

score 2 · Answer 4 · answered Jun 09 '11 at 00:53

2

From my personal experience, Ruby 1.9.2 handles unicode internally pretty good except some strange areas like upcase/downcase/capitalize methods for String class. I have to override them for all my Rails applications.

answered Jun 09 '11 at 00:53

Jake Jones

1,170
1
11
15

2

"I have to override them for all my Rails applications." -- Doesn't that mean that, on the contrary, it's a lot poorer than what [this series of posts](http://blog.grayproductions.net/articles/understanding_m17n) would suggest? – Denis de Bernardy Jun 09 '11 at 01:03
@Denis, if you don't think ruby/python/java/PHP is right for you and you are not gonna even try that, you should find another interest other than programming. – user482594 Jun 09 '11 at 01:18
@user: believe me, I've tried quite a few languages (in fact, all of the ones you mention). Enough to know that there's a difference between trying/copying/pasting a few lines of code here and there, and managing to break it by knowing it inside and out. I wouldn't have asked the question if not... – Denis de Bernardy Jun 09 '11 at 01:20
Not sure about "alot" its just 3 methods after all. I have no idea, why they didn't add unicode support for them still. – Jake Jones Jun 09 '11 at 01:54
2

Jörg W Mittag [explained](http://stackoverflow.com/questions/3724913/capitalize-first-letter-in-ruby/3726650#3726650) this. – steenslag Jun 09 '11 at 10:38
+1. Steenslag chimed in with his own answer on ruby, explaining the three exceptions you mentioned. – Denis de Bernardy Jun 09 '11 at 10:39
@daekrist: ICU and iconv do a rather good job at it. For instance, in Postgres: `select upper('jörg'); -- JÖRG`. – Denis de Bernardy Jun 09 '11 at 10:55
@Denis: What does `select lower('MASSE');` return? – Jörg W Mittag Jun 10 '11 at 16:16
That'll return masse because my cluster is initialized as english. Had it been initialized in German instead, I'm 100% sure that the [lc_ctype](http://www.postgresql.org/docs/9.0/interactive/locale.html) would kick in when upper-casing Maße, and I'd expect it to lowercase it back to masse (some [ICU transformations](http://site.icu-project.org/) are one-way only, and the dictionary won't kick in on every call to lowercase). – Denis de Bernardy Jun 10 '11 at 16:32
@Jorg (or other): on a separate note, have either of you given Rubymine a try? I'm currently on Textmate, and was looking into VIM a bit, for Ruby development... (I hear Rubymine is slow.) – Denis de Bernardy Jun 10 '11 at 16:33

score 2 · Answer 5 · answered Jun 09 '11 at 02:44

2

Lisps have strong support for unicode. All modern popular lisps (SBCL, Clozure CL, clisp) use UTF-32/UCS-4 for strings and support UTF-8 as an external format.

answered Jun 09 '11 at 02:44

dmitry_vk

4,399
18
21

Thanks... Do you know if there is a means to configure it to use an explicit locale/set of collation rules (so that string comparisons respect the latter). In Postgres, for instance, I can write: `SELECT col1 < col2 COLLATE "de_DE" FROM test;`. – Denis de Bernardy Jun 09 '11 at 11:01

score 1 · Answer 6 · edited May 23 '17 at 11:59

Ruby examples :

# encoding: UTF-8
puts RUBY_VERSION  # => 1.9.2

def Σ(arr)
  arr.inject(:+)
end

Π = Math::PI
str = "abc日本def"

puts Σ [4,6,8,3]  # => 21
puts Π            # => 3.141592653589793
puts str.scan(/\p{Han}+/)  # => 日本
p Encoding.name_list # not just utf8
#["ASCII-8BIT", "UTF-8", "US-ASCII", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-JP", "EUC-KR", "EUC-TW", "GB18030", "GBK", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", "KOI8-R", "KOI8-U", "Shift_JIS", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "Windows-1251", "BINARY", "IBM437", "CP437", "IBM737", "CP737", "IBM775", "CP775", "CP850", "IBM850", "IBM852", "CP852", "IBM855", "CP855", "IBM857", "CP857", "IBM860", "CP860", "IBM861", "CP861", "IBM862", "CP862", "IBM863", "CP863", "IBM864", "CP864", "IBM865", "CP865", "IBM866", "CP866", "IBM869", "CP869", "Windows-1258", "CP1258", "GB1988", "macCentEuro", "macCroatian", "macCyrillic", "macGreek", "macIceland", "macRoman", "macRomania", "macThai", "macTurkish", "macUkraine", "CP950", "CP951", "stateless-ISO-2022-JP", "eucJP", "eucJP-ms", "euc-jp-ms", "CP51932", "eucKR", "eucTW", "GB2312", "EUC-CN", "eucCN", "GB12345", "CP936", "ISO-2022-JP", "ISO2022-JP", "ISO-2022-JP-2", "ISO2022-JP2", "CP50220", "CP50221", "ISO8859-1", "Windows-1252", "CP1252", "ISO8859-2", "Windows-1250", "CP1250", "ISO8859-3", "ISO8859-4", "ISO8859-5", "ISO8859-6", "Windows-1256", "CP1256", "ISO8859-7", "Windows-1253", "CP1253", "ISO8859-8", "Windows-1255", "CP1255", "ISO8859-9", "Windows-1254", "CP1254", "ISO8859-10", "ISO8859-11", "TIS-620", "Windows-874", "CP874", "ISO8859-13", "Windows-1257", "CP1257", "ISO8859-14", "ISO8859-15", "ISO8859-16", "CP878", "SJIS", "Windows-31J", "CP932", "csWindows31J", "MacJapanese", "MacJapan", "ASCII", "ANSI_X3.4-1968", "646", "UTF-7", "CP65000", "CP65001", "UTF8-MAC", "UTF-8-MAC", "UTF-8-HFS", "UCS-2BE", "UCS-4BE", "UCS-4LE", "CP1251", "UTF8-DoCoMo", "SJIS-DoCoMo", "UTF8-KDDI", "SJIS-KDDI", "ISO-2022-JP-KDDI", "stateless-ISO-2022-JP-KDDI", "UTF8-SoftBank", "SJIS-SoftBank", "locale", "external", "filesystem", "internal"]

Indeed, capitalization is not supported for non-ascii chars, with reason.

OK... So if I get this right, the likes of str.len(), str.sub(), and friends are all multibyte aware; and it's just locale-aware functionality such as upper()/lower() that needs attention? And if so, is there any means to let ruby know which locale/transliteration rules to use at the app level? — Denis de Bernardy, Jun 09 '11 at 10:38
Case replacement (upcase, swapcase, capitalize etc) is only effective in the ASCII region, so "jörg".upcase is "JöRG". I'm not aware of anything language-specific beyond that. — steenslag, Jun 09 '11 at 10:49
Right. But is there a means to configure it to use an explicit locale/collation rules somehow (so that string comparisons respect collation rules, for instance). In Postgres, for instance, I can write: `SELECT col1 < col2 COLLATE "de_DE" FROM test;`. — Denis de Bernardy, Jun 09 '11 at 11:00
Since Ruby 2.4 [String#downcase etc takes an optional parameter](https://ruby-doc.org/core-2.4.0/String.html#method-i-downcase), the default being full unicode case mapping. — steenslag, Aug 20 '17 at 22:28

How well does your language support unicode in practice?

6 Answers6

Linked