19

Which widely used programming languages were designed ground-up with Unicode support?

A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?

imz -- Ivan Zakharyaschev
  • 4,921
  • 6
  • 53
  • 104
knorv
  • 49,059
  • 74
  • 210
  • 294
  • 8
    Of course, many of the most popular and successful programming languages predate the introduction of Unicode. – Ned Deily Sep 12 '09 at 21:49
  • 4
    And ( apart from python 3 ) most predate the extension of unicode above 16 bits - both Java and the .net languages have UCS2 support with methods for handling surrogates tacked on later – Pete Kirkham Sep 12 '09 at 22:32
  • Just curiously, why do you ask? – Roman Sep 18 '09 at 13:56
  • Which version of Unicode are you talking about? Plenty of languages were designed with Unicode 1.0 support, but few with Unicode 4.0 support "from day one". Which specific Unicode features are you interested in? – Daniel Pryden Sep 18 '09 at 21:49
  • 2
    Roman: Pure language history interest. – knorv Sep 20 '09 at 13:07
  • Could Haskell's syntax be defined differently if it was made up with Unicode and multilingualism in mind?.. I was intrigued to learn from [a smart answer](http://stackoverflow.com/questions/5517412/where-is-it-specified-whether-unicode-identifiers-should-be-allowed-in-a-haskell/5517639#5517639) that although Haskell allows Unicode ids it's a problem to program in Haskell with all ids in your favorite human language if the language is written uni-case (no upper vs. lower case distinction in Unicode for it) because Haskell supposes that one uses different cases in ids for different purposes! – imz -- Ivan Zakharyaschev Apr 04 '11 at 00:23

12 Answers12

33

Java was probably the first popular language to have ground-up Unicode support.

Ken Keenan
  • 9,818
  • 5
  • 32
  • 49
  • 5
    Apart from the fact that it "only" supports the Basic Multilingual Plane (which was all that Unicode had when Java was invented). The .NET framework is the first language I know which is designed around "full" unicode support (including correct Length for strings that contain surrogates...) – mihi Sep 12 '09 at 21:49
  • 7
    Java supports the full Unicode standard since forever, not just the BMP. Strings are stored in UTF-16 (not UCS-2, which would mean BMP only). – Joachim Sauer Sep 12 '09 at 22:09
  • 6
    When Java was designed, Unicode was only the BMP. According to MSDN's documentation on String in .Net - "The Length property returns the number of Char objects in this instance, not the number of Unicode characters. ". The java.lang.String.codePointCount() method returns the number of code points in the string taking account for surrogates. – Pete Kirkham Sep 12 '09 at 22:29
  • 1
    @ Joachim Sauer: UCS-2 support the full Unicode standard (don't forget the surrogate pairs D800 to DBFF). Java was designed to use UTF-16 as was the .Net framework, but Java was designed before UTF-32/UCS-4 while .NET was designed after, but both languages have access to the full range of code points. – Martin York Sep 21 '09 at 03:46
  • 2
    @JoachimSauer: And yet there's *still* no built-in functionality to *iterate over the characters of a `String`*. – Mechanical snail Aug 19 '12 at 06:44
31

Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.

Jay Bazuzi
  • 45,157
  • 15
  • 111
  • 168
  • 1
    Really? High five Microsoft! Any idea if IronRuby, IronPython and F# are in the same boat? – George Mauer Sep 12 '09 at 21:48
  • 7
    George, all .NET languages that use the System.String class have full Unicode support. I don't know of any .NET languages that don't use the System.String class, so that means IronRuby, IronPython and especially F# (which is a first class language starting with VS2010) have native Unicode support. I can't think of a good reason why someone would create a .NET language and make a special non-Unicode string class for it when a Unicode string class is already provided in the BCL. – Allon Guralnek Sep 17 '09 at 08:14
  • 1
    Strictly speaking, a System.String is composed of UTF-16 encoded characters, not Unicode 5 abstract code points (graphemes). If your app cares about the difference (most won't need to), then you can use the System.Globalization.StringInfo class. – Christian Hayter Sep 18 '09 at 20:10
  • 1
    Can you make a CLS compliant language without System.String support? – Chris S Sep 21 '09 at 14:48
20

There were many breaking changes in Python 3, among them the switch to Unicode for all text.

So Python wasn't designed ground-up for Unicode, but Python 3 was.

Sam
  • 1,509
  • 3
  • 19
  • 28
Mark Rushakoff
  • 249,864
  • 45
  • 407
  • 398
  • Unicode support was added to Python in 2000. So whilst not quite "ground up" it's pretty early. http://www.python.org/dev/peps/pep-0100/ (edit: actually that document was turned into a PEP in 2000, the Unicode support probably predates this) – fuzzyman Feb 01 '12 at 15:18
  • 1
    I'm surprised by the amount of upvotes this answer has. Python 3 is just a major release, not a new programming language. – yeyo Jan 29 '15 at 23:16
12

I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:

using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט  {
    public תוצאה בית() {
        int אלף = 0;
        for (int λ = 0; λ < 20; λ++) אלף+=λ;
        return אלף;
    }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 2
    (note that there is possibly some odd right-to-left issue in the above in browser/editor; if you paste it into VS, it is "int {name} = 0") – Marc Gravell Sep 19 '09 at 10:02
  • 2
    @gw: Try running `"πθ√".Select(c=>CharUnicodeInfo.GetUnicodeCategory(c))` in LINQPad and you'll see why ;-) – Eamon Nerbonne Sep 21 '09 at 11:06
  • 4
    Same in Perl 5 and Perl 6. Perl 6 has even Unicode operators. – Alexandr Ciornii Dec 06 '09 at 19:43
  • Python has supported an explicit encoding for source code files for a long time. It's only in Python 3 that you can have Unicode identifiers. Unicode operators is a terrible idea... – fuzzyman Feb 01 '12 at 15:21
10

Google's Go programming language supports Unicode and works with UTF-8.

Rohit
  • 7,449
  • 9
  • 45
  • 55
7

It really is difficult to design Unicode support for the future, in a programming language right from the beginning.

Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.

Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.

Therefore, don't be surprised if different programming languages implement Unicode support differently.

On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.

PS: Support for v5.1 of Unicode is to be made in Java 7.

Vineet Reynolds
  • 76,006
  • 17
  • 150
  • 174
5

Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.

The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)

Other languages that support non-ASCII identifiers include current PHP.

4

Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)

General overview

Unicode operators

Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).

A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.

Community
  • 1
  • 1
Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
2

Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html

janneb
  • 36,249
  • 2
  • 81
  • 97
1

Sometimes, a feature that was included in a language when it was first designed is not always the best.

Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.

So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.

With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.

lkessler
  • 19,819
  • 36
  • 132
  • 203
-1

Java uses characters from the Unicode character set.

sdu
  • 2,740
  • 4
  • 31
  • 30
  • 13
    Most programming languages use characters from the Unicode character set. ( they just place restrictions on which characters they use ) – Pete Kirkham Sep 12 '09 at 22:09
-1

java and .net languages

eglasius
  • 35,831
  • 5
  • 65
  • 110