7

Timezones for (date)-times and encoding for strings are no problem if you do not have do convert between them. In Ruby 1.9 and 2.0, encodings seem to be the new timezones from older Ruby versions, they cause nothing but trouble. Iconv has been replaced by the native encoding functions. How do you convert from the standard UTF-8 to ISO-8859-1, for example for the use in Windows systems? In the Ruby 2.0 console the encode function does not work, although it should be able to convert from a source encoding to a destination encoding via encode(dst_encoding, src_encoding) → str?

>> "ABC äöüÄÖÜ".encoding
=> #<Encoding:UTF-8>
>> "ABC äöüÄÖÜ".encode("UTF-8").encode("ISO-8859-1")
=> "ABC \xE4\xF6\xFC\xC4\xD6\xDC"
>> "ABC äöüÄÖÜ".encode("ISO-8859-1","UTF-8")
=> "ABC \xE4\xF6\xFC\xC4\xD6\xDC"

I am using Ruby 2.0.0 (Revision 41674) on a linux system.

0x4a6f4672
  • 27,297
  • 17
  • 103
  • 140
  • 1
    What's the problem? `"ABC äöüÄÖÜ".encode("ISO-8859-1","UTF-8")` converts UTF-8 to Latin-1 and returns the Latin-1 string, `"ABC äöüÄÖÜ"` is already UTF-8 so the second argument to `encode` is irrelevant. What are you expecting `"ABC äöüÄÖÜ".encode("ISO-8859-1","UTF-8")` to do and how is your expectation different from what does happen? – mu is too short Oct 09 '13 at 17:44
  • The string "ABC \xE4\xF6\xFC\xC4\xD6\xDC" does not look like a text with valid encoding to me, or does it? At least the special characters are not displayed correctly. The goal was to convert UTF-8 strings for an Excel import. So far it does not seem to work well. The idea was if I manage to convert UTF-8 to ISO-8859-1, then Excel will import and display the texts on Windows as well. – 0x4a6f4672 Oct 10 '13 at 08:15
  • 1
    That is a Latin-1 encoding string being displayed in a UTF-8 terminal. – mu is too short Oct 10 '13 at 17:02

1 Answers1

10

The encode method does work.

Let's create a string with U+00FC (ü):

uuml_utf8 = "\u00FC"       #=> "ü"

Ruby encodes this string in UTF-8:

uuml_utf8.encoding         #=> #<Encoding:UTF-8>

In UTF-8, ü is represented as 195 188 (decimal):

uuml_utf8.bytes            #=> [195, 188]

Now let's convert the string to ISO-8859-1:

uuml_latin1 = uuml_utf8.encode("ISO-8859-1")

uuml_latin1.encoding       #=> #<Encoding:ISO-8859-1>

In ISO-8859-1, ü is represented as 252 (decimal):

uuml_latin1.bytes          #=> [252]

In UTF-8 however 252 is an invalid sequence. That's why your terminal/console displays the replacement character "�" (U+FFFD) or no character at all.

In order to display ISO-8859-1 encoded characters, you'll have to switch your terminal/console to that encoding, too.

Stefan
  • 109,145
  • 14
  • 143
  • 218
  • Yes, but in your example uuml_latin1 has the value "\xFC" and not the special character "ü". 'print uuml_latin1' gives � , while 'puts uuml_latin1' produces an empty string. Something seems to be wrong, or are the Ruby functions not able to display ISO-8859-1 encodings? – 0x4a6f4672 Oct 10 '13 at 08:23
  • 0xFC is indeed the hex value for 252. This means Ruby 2.0 is not able to display strings with ISO-8859-1 encoding correctly, using the right characters? Why does it work with UTF-8 encoding, but not with ISO-8859-1 encoding? – 0x4a6f4672 Oct 10 '13 at 08:32
  • Ruby doesn't *display* the strings, your terminal does. Change it from UTF-8 to ISO-8859-1 and you'll see a `ü`. – Stefan Oct 10 '13 at 09:22
  • Ok, so the reason the encoding seems to wrong is that the terminal/console/bash can not display it, because it has the wrong locale/charset/character map/whatever. – 0x4a6f4672 Oct 10 '13 at 09:42
  • Exactly, `0xFC` is not a valid UTF-8 sequence. It's like opening a ISO-8859-1 file in an UTF-8 editor. – Stefan Oct 10 '13 at 09:46
  • Can you update the answer accordingly? Then I can accept it. I finally managed to generate and export the right encoding. The encode function does work, but a) the terminal is not able to display ISO-8859-1 (Latin-1) or ISO-8859-15 (Latin-9) encodings because it uses UTF-8 as default, and b) it has to be used in the right places, for instance if you send it with send_data it is also necessary to call it there `send_data(csv_string.encode("ISO-8859-15"), :type => 'text/csv;charset=ISO-8859-15')` http://stackoverflow.com/questions/9639153/character-encoding-issue-exporting-rails-data-to-csv – 0x4a6f4672 Oct 10 '13 at 13:05
  • How do you know *In UTF-8 however 252 is an invalid sequence* ? asking out of curiosity ? – Arup Rakshit Jan 30 '14 at 19:13
  • 1
    @ArupRakshit http://en.wikipedia.org/wiki/UTF-8#Codepage_layout 192-193 and 245-255 (the red cells) are invalid – Stefan Jan 30 '14 at 20:19