5

Please follow the code:

__ENCODING__
# => #<Encoding:UTF-8>

Encoding.default_internal
# => #<Encoding:UTF-8> 

Encoding.default_external
# =>  #<Encoding:UTF-8> 

Case 1: HAML throws Encoding::UndefinedConversionError

string = "j\xC3\xBCrgen".force_encoding('ASCII-8BIT')

string.encoding
# =>  #<Encoding:ASCII-8BIT>  

Haml::Engine.new("#{string}").render
## => Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8

ERB.new("<%= string %>").result(binding)
# => "jürgen"
# => Resulting encoding is #<Encoding:UTF-8> 

Erubis::Eruby.new("<%= string %>").result(binding)       
# => "j\xC3\xBCrgen"
# => resulting encoding is #<Encoding:ASCII-8BIT>

Case 2: HAML doesn't throw error

string = "Ratatouille".force_encoding('ASCII-8BIT')

string.encoding
# => #<Encoding:ASCII-8BIT>

Haml::Engine.new("#{string}").render
## => "Ratatouille\n"
## => resulting encoding is #<Encoding:UTF-8>    

ERB.new("<%= string %>").result(binding)
# => "Ratatouille"
# => resulting encoding is #<Encoding:UTF-8> 

Erubis::Eruby.new("<%= string %>").result(binding)
# => "Ratatouille" 
# => result encoding is #<Encoding:US-ASCII> 

Question : Why is HAML failing in case 1 and succeeding in case 2

Why I'm asking I'm facing the similar problem when a rendering in HAML which blow up page because of Encoding::CompatibilityError

The only way right now I think I know how to avoid error this is do a force_encoding of my string to UTF8 using .force_encoding('UTF-8') which sort of avoid this issue but I have to do this in every page where I want to use the given string i.e "j\xC3\xBCrgen" (which I found kind of lame to do considering their many pages)

Any clue ??

Uri Agassi
  • 36,848
  • 14
  • 76
  • 93
Ratatouille
  • 1,372
  • 5
  • 23
  • 50

2 Answers2

1

From the PickAxe book:

Ruby supports a virtual encoding called ASCII-8BIT . Despite the ASCII in the name, this is really intended to be used on data streams that contain binary data (which is why it has an alias of BINARY }). However, you can also use this as an encoding for source files. If you do, Ruby interprets all characters with codes below 128 as regular ASCII and all other characters as valid constituents of variable names. This is basically a neat hack, because it allows you to compile a file written in an encoding you don’t know—the characters with the high-order bit set will be assumed to be printable.

String#force_encoding tells Ruby which encoding to use in order to interpret some binary data. It does not change/convert the actual bytes (that would be String#encode), just changes the encoding associated with these bytes.

Why would you try to associate a BINARY encoding to a string containing UTF-8 characters anyway?

Regarding your question about why the second case succeeds the answer is simply that your second string ("Ratatouille") contains only 7-bit ASCII characters.

Kostas Rousis
  • 5,918
  • 1
  • 33
  • 38
1

Haml is trying to encode the result string to your Encoding.default_internal setting. In the first example the string ("j\xC3\xBCrgen") contains non ASCII bytes (i.e. bytes with the high bit set), whilst the string in the second example ("Ratatouille") doesn’t. Ruby can encode the second string (since UTF-8 is a superset of ASCII), but can’t encode the first and raises an error.

One way to work round this is to explicitly pass the string encoding as an option to Haml::Encoding:

Haml::Engine.new("#{string}", :encoding => Encoding::ASCII_8BIT).render

this will give you a result string that is also ASCII-8BIT.

In this case the string in question is UTF-8 though, so a better solution might be to look at where the string is coming from in your app and ensure it has the right encoding.

I don’t know enough about ERB and Erubis to say what’s happening, it looks like ERB is incorrectly assuming it is UTF-8 (it has now way to know those bytes should actually be treated as UTF-8) and Erubis is doing the more sensible thing of leaving the encoding as binary – either because it isn’t doing any encoding at all, or it is treating binary encoded input specially.

matt
  • 78,533
  • 8
  • 163
  • 197
  • It might be argued that raising an error here is a bug in Haml. I’m not sure how the other users of `default_internal` handle this, it might be better to leave binary files unchanged. – matt Jul 01 '14 at 13:31
  • Yes even I felt that raising a error in HAML is kind of retard since the default internal encoding is `UTF-8` and I assuming that "j\xC3\xBCrgen" is a valid UTF-8 string – Ratatouille Jul 03 '14 at 08:06
  • So to get rid of error always use `force_encoding('UTF-8')` is what I should take from the above answer correct – Ratatouille Jul 03 '14 at 08:08
  • @Ratatouille "j\xC3\xBCrgen" is a valid UTF-8 string, but there’s no way of knowing if it is supposed to be UTF-8 from the string alone. You should avoid using `force_encoding` generally – the best way to fix this is to ensure your strings are properly encoded whenever you read them in from an external source. If you can’t do that for some reason and you know the string is valid UTF-8 you could resort to `force_encoding`. – matt Jul 03 '14 at 14:48
  • Thanks a ton sorry for replying a bit late accepting the answer – Ratatouille Jul 08 '14 at 08:01