32

I am just starting to learn Ruby (to eventually move to RoR), but I was just told that Ruby does not support unicode. Is it true? How do Ruby programmers go about supporting unicode?

Regis Zaleman
  • 3,182
  • 6
  • 29
  • 30

5 Answers5

31

What you heard is outdated and applies (only partially) to Ruby 1.8 or before. The latest stable version of Ruby (1.9), supports no less than 95 different character encodings (counted on my system just now). This includes pretty much all known Unicode Transformation Formats, including UTF-8.

The previous stable version of Ruby (1.8) has partial support for UTF-8.

If you use Rails, it takes care of default UTF-8 encoding for you. If all you need is UTF-8 encoding awareness, Rails will work for you no matter if you run Ruby 1.9 or Ruby 1.8. If you have very specific character encoding requirements, you should aim for Ruby 1.9.

If you're really interested, here is a series of articles describing the encoding issues in Ruby 1.8 and how they were worked around, and eventually solved in Ruby 1.9. Rails still includes workarounds for many common flaws in Ruby 1.8.

molf
  • 73,644
  • 13
  • 135
  • 118
  • for anyone like myself seeking a shortcut to a $KCODE equivalent for programmatic default encoding switch, what you want is: Encoding.default_internal = 'utf-8' # Encoding.list.map(&:names) – Travis Jun 29 '11 at 20:18
15

Adding the following line on top my file solved it.

# encoding: utf-8
Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
14

That's not true. What is true is that Ruby does not support only Unicode, it supports a whole slew of other encodings as well.

This is in contrast to systems such as Java, .NET or Python, which follow the "One Encoding To Rule Them All" model. Ruby has what one of the designers of Ruby's m17n system calls a "CSI" model (Code Set Indepedent), which means that instead of all strings just having one and the same encoding, every string is tagged with its own encoding.

This has some significant advantages both for ease of use and performance, because it means that if your input and output encodings are the same, you never need to transcode, whereas with the One True Encoding model, you need to transcode twice in the worst case (and that worst case unfortunately happens pretty often, because most of these environments chose an internal encoding that nobody actually uses), from the input encoding into the internal encoding and then to the output encoding. In Ruby, you need to transcode at most once.

The basic problem with the OTE model is that whatever encoding you choose as the One True Encoding, it will be a completely arbitrary choice, since there simply isn't a single encoding that everybody, or even a majority, uses.

In Java, for example, they chose UCS-2 as the One True Encoding. Then, a couple of years later, it turned out that UCS-2 was actually not enough to encode all characters, so they had to make a backwards-incompatible change to Java, to switch to UTF-16 as the One True Encoding. Except by that time, a significant portion of the world had moved on from UTF-16 to UTF-8. If Java had been invented a couple of years earlier, they would probably have chosen ASCII as the One True Encoding. If it had been invented in another country, it might be Shift-JIS. If it had been invented by another company, it might be EBCDIC. It's really completely arbitrary, and such an important choice shouldn't be.

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • 3
    @tchrist: It is an encoding in the sense that it assigns a unique *number* for every character (which is pretty much the dictionary definition of "encoding"). It isn't an encoding in the sense that it doesn't assign a unique *bit pattern* for every character (in Unicode lingo, that's the transfer format's job). Unfortunately, I've never been able to come up with a good name for what Unicode is, other than "encoding". – Jörg W Mittag Feb 06 '11 at 15:44
  • 6
    Jörg: [#1] A **character repertoire** is the complete set of purely abstract characters. [#2] A **coded character set** maps those abstract characters to non-negative integers called *code points* in a 1:1 relationship. [#3] A **character encoding function** (or form) defines a precise bitwise layout for serializing those integer code points. This might be easier understood by looking at a smaller repertoire than Unicode’s. Radix‑50 has a 50‐character repertoire with 2 different coded character sets (pre- vs post PDP‑11) whose code points pack 3 at a time into a 16‑bit word. *(continued…)* – tchrist Feb 06 '11 at 16:15
  • 4
    *(…continued)* Per def#1, Unicode is a repertoire that includes such abstract characters as LATIN CAPITAL LETTER AE WITH MACRON, GERMAN PENNY SIGN, and CIRCLED WZ. Per def#2, those 3 abstract characters are respectively assigned the code points 1E2₁₆, 20B0₁₆, and 1F12E₁₆. Per def#3, those integers serialize under UTF‑8 to "\xC7\xA2", "\xE2\x82\xB0", & "\xF0\x9F\x84\xAE"; under UTF‑16BE to "\xE2\x01", "\xB0\x20", & "\x3C\xD8\x2E\xDD"; under UTF‑32LE to "\x00\x00\x01\xE2", "\x00\x00\x20\xB0", & "\x00\x01\xF1\x2E". Me, I use *encoding* for def#3 and *code point assignment* for def#2. Make sense? – tchrist Feb 06 '11 at 16:35
  • 1
    **[CORRECTION]** Radix-50 has a 50₈ character repertoire, which is 40₁₀. – tchrist Feb 06 '11 at 16:41
  • 5
    @Jörg: You forgot to mention Perl. Its model is cleaner than Java’s, because it uses logical code points (def#2) not serialized ones (def#3) as Java and Python ill-advisedly. But yes, everything normalizes to the Unicode repertoire (def#1). I have yet to see any reasonable demonstration why you would want alien, non-Unicode-able code points or to carry every string’s original serialization along with it forever. I consider it a severe flaw that Ruby does so, not any sort of desirable feature. It also suggests a misunderstanding of the immense “private use” section of Unicode. – tchrist Feb 06 '11 at 17:44
  • "you need to transcode twice in the worst case' - Java seems to be doing okay compared to Ruby on the performance front :) And from a space point of view, doesn't tagging every string with its encoding balloon the memory usage a bit? – Rob Grant Aug 27 '14 at 14:54
  • @RobertGrant: I doubt those performance differences are due to encodings. Actually, I'm pretty sure that even the slowest Ruby implementations can compete favorably with the fastest Java implementations in text processing performance. For example, JRuby doesn't use Java's text processing at all, they duplicate all of that functionality on top of `byte[]` arrays, because Java's text processing performance is so slow. Joni, the Java port of Ruby's `Regexp` engine Onigmo, is sometimes significantly faster than Java regex engines, despite the fact that Joni is much more powerful. – Jörg W Mittag Aug 27 '14 at 22:05
  • @JörgWMittag yes agreed it's not because of encodings :) Was just poking the Ruby bear :) – Rob Grant Aug 28 '14 at 07:18
5

This is quite an old question. The current stable version of Ruby is 2.0.1. Yes, it handles most of what you can throw in Unicode at it, but please be aware that it breaks fairly easily.

Take a look at this code sample and results (inspired by this):

["noël","","baffle"].each do |str|
  puts "Result for '#{str}'"
  puts "  Size: #{str.size}"
  puts "  Reverse: [#{str.reverse}]"
  puts "  Uppercase: [#{str.upcase}]"
end  

Result for 'noël'
  Size: 5 << bad size
  Reverse: [l̈eon] <= accent is shifted
  Uppercase: [NOËL]
Result for ''
  Size: 2
  Reverse: []
  Uppercase: []
Result for 'baffle'
  Size: 4
  Reverse: [efflab] <= doesn't really make sense
  Uppercase: [BAfflE] <= should be "ELFFAB"

The point is: modern Ruby handles the basics - more advanced string features shouldn't be counted on.

J.R. Garcia
  • 639
  • 5
  • 20
GregPK
  • 1,232
  • 11
  • 17
  • I didn't get your comment. Why doesn't `efflab` being the reverse of `baffle` make sense? Or why should the uppercase of `baffle` be `ELFFAB`? – eis Apr 11 '14 at 09:47
  • reverse of `baffle` should be `elffab`, not `efflab` :-) – kralyk May 25 '14 at 10:15
  • 4
    @kralyk @GregPK Looks like `baffle` is treated correctly, considering ffl is a single character. It does really make sense. :) – ray Oct 29 '14 at 02:57
  • Somebody should actually report this bug for 'noël'. As of now you can install `rails` with `gem`, require `active_support/core_ext/string` and use `str.mb_chars.reverse` instead. – wieczorek1990 Feb 18 '16 at 23:19
  • I do not know any use case where I needed to reverse a Unicode string. What is the use case you have for needing a working reverse in Unicode strings? – Eduardo Jun 12 '16 at 11:08
  • `['a', 'ą', 'b'].sort` also fails (returns `["a", "b", "ą"]` instead of `["a", "ą", "b"]` in Ruby 2.3.4) – reducing activity May 24 '17 at 04:41
0

In this answer to a different question, one person said they had trouble with Iconv when handling unicode data in Ruby 1.9, but I can't vouch for its accuracy.

Community
  • 1
  • 1
Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338