10

I have a queue of text messages in Redis. Let's say a message in redis is something like this:

"niño" 

(spot the non standard character).

The rails app displays the queue of messages. When I test locally (Rails 3.2.2, Ruby 1.9.3) everything is fine, but on Heroku cedar (Rails 3.2.2, I believe there is ruby 1.9.2) I get the infamous error: ActionView::Template::Error (invalid byte sequence in UTF-8)

After reading and rereading all I could find online I am still stuck as to how to fix this.

Any help or point to the right direction is greatly appreciated!

edit:

I managed to find a solution. I ended up using Iconv:

string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]

None of the suggested answers i found around seem to work in my case.

matt
  • 78,533
  • 8
  • 163
  • 197
klaut
  • 210
  • 3
  • 10
  • i installed through heroku labs ruby 1.9.3, but i still get the same error :| – klaut Apr 06 '12 at 18:20
  • 2
    When requiring Iconv in Ruby 1.9.3 you get this warning: `iconv will be deprecated in the future, use String#encode instead.` The equivalent to your solution would be something like: `string.force_encoding('iso-8859-1').encode('utf-8')`. – matt Apr 06 '12 at 22:27
  • 2
    Or `string = message.encode('utf-8', 'iso-8859-1')` might be better. – matt Apr 06 '12 at 23:01
  • good point, thanks!.. the only thing that bugs me is that with my solution, now on my local machine (macosx) i see the converted one as "niño", whereas the not converted one are correct "niño". Still can't figure out why. – klaut Apr 07 '12 at 08:57
  • Ruby 1.9+'s default internal encoding is set to the same encoding that your computer is set to use. Heroku is using UTF-8. If your computer is set to something other than UTF-8 (i.e. ISO-8859-1) this would explain the difference in behavior. – coreyward Apr 07 '12 at 20:32
  • Are you using UTF-8 to transfer data over HTTP? If not, have you configured Rails to use whatever encoding you're using over HTTP? By default, Rails uses UTF-8. Also, if you're not using UTF-8 you will want to make sure that you configure service connections to use or convert data to the proper encoding (and ensure their libraries are compatible). – coreyward Apr 07 '12 at 20:34
  • Related: http://stackoverflow.com/questions/4697413/character-encoding-issue-in-rails-v3-ruby-1-9-2/4697471#4697471 – coreyward Apr 07 '12 at 20:36
  • Where do you put the `string = Iconv.iconv('UTF-8', 'ISO-8859-1', message)[0]` statement? – Muhammed Bhikha Dec 22 '12 at 15:51
  • 1
    i was encoding email messages that were sent to my application. so this line was used before i put the text into redis. I just checked the code again (as it has changed since my original question) and now i am doing it like this: - i first check what encoding the email is in `email_text_encoding = JSON.parse(params['charsets'])['text']` - i then use that encoding to convert it to utf8 `utf_ed = text.encode('utf-8', email_text_encoding)` – klaut Dec 25 '12 at 20:56

1 Answers1

37

On Heroku, when your app receives the message "niño" from Redis, it is actually getting the four bytes:

 0x6e 0x69 0xf1 0x6f

which, when interpreted as ISO-8859-1 correspond to the characters n, i, ñ and o.

However, your Rails app assumes that these bytes should be interpreted as UTF-8, and at some point it tries to decode them this way. The third byte in this sequence, 0xf1 looks like this:

1 1 1 1 0 0 0 1

If you compare this to the table on the Wikipedia page, you can see this byte is the leading byte of a four byte character (it matches the pattern 11110xxx), and as such should be followed by three more continuation bytes that all match the pattern 10xxxxxx. It's not, instead the next byte is 0x6f (01101111), and so this is invalid utf-8 byte sequence and you get the error you see.

Using:

string = message.encode('utf-8', 'iso-8859-1')

(or the Iconv equivalent) tells Ruby to read message as ISO-8859-1 encoded, and then to create the equivalent string in UTF-8 encoding, which you can then use without problems. (An alternative could be to use force_encoding to tell Ruby the correct encoding of the string, but that will likely cause problems later when you try to mix UTF-8 and ISO-8859-1 strings).

In UTF-8, the string "niño" corresponds to the bytes:

0x6e 0x69 0xc3 0xb1 0x6f

Note that the first, second and last bytes are the same. The ñ character is encoded as the two bytes 0xc3 0xb1. If you write these out in binary and compare to the table in the Wikipedia again article you'll see they encode 0xf1, which is the ISO-8859-1 encoding of ñ (since the first 256 unicode codepoints match ISO-8859-1).

If you take these five bytes and treat them as being ISO-8859-1, then they correspond to the string

niño

Looking at the ISO-8859-1 codepage, 0xc3 maps to Â, and 0xb1 maps to ±.

So what's happening on your local machine is that your app is receiving the five bytes 0x6e 0x69 0xc3 0xb1 0x6f from Redis, which is the UTF-8 representation of "niño". On Heroku it's receiving the four bytes 0x6e 0x69 0xf1 0x6f, which is the ISO-8859-1 representation.

The real fix to your problem will be to make sure the strings being put into Redis are all already UTF-8 (or at least all the same encoding). I haven't used Redis, but from what I can tell from a brief Google, it doesn't concern itself with string encodings but simply gives back whatever bytes it's been given. You should look at whatever process is putting the data into Redis, and ensure that it handles the encoding properly.

matt
  • 78,533
  • 8
  • 163
  • 197