Determine character encoding in Ruby 1.9.3

Question

My Rails 3.2.2 / Ruby 1.9.3 application gets search requests such as:

http://booko.com.au/books/search?q=Fran%E7ois+Vergniolle+de+Chantal

Ruby / Rails takes this query and decodes it - but assumes it's UTF-8. At some point I get a :

invalid byte sequence in UTF-8
app/models/product.rb:694:in `upcase'

I think it's doing something like this:

q="Fran%E7ois+Vergniolle+de+Chantal"
=> "Fran%E7ois+Vergniolle+de+Chantal"

CGI.unescape( q )
=> "Fran\xE7ois Vergniolle de Chantal"

CGI.unescape( q ).encoding.name
=> "UTF-8"

CGI.unescape( q ).valid_encoding?
=> false

What is the correct way of dealing with this? I'd like to transcode it to the correct encoding - but how do I determine the current encoding? What I'm currently doing, is just assuming it's LATIN1:

q.encode!("ISO-8859-1", "UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Or doing something I found on a blog somewhere:

q = q.unpack('C*').pack('U*')

What's the right way of dealing with this?

Edit The server is correctly sending "Content-Type: text/html; charset=utf-8" header to the client. The page also contains the appropriate meta tag: 'meta http-equiv="content-type" content="text/html;charset=UTF-8"'

Not sure if there's another method to tell the client which encodings to use?

What if you will write `# coding: UTF-8` at the top of `app/models/product.rb`. I think it should solve that error. Will you satisfied with this solution? — ck3g, Mar 21 '12 at 06:50
You would have to use some kind of dictionary in order to determine the correct encoding, as the same byte `0xE7` could be (and indeed is) a valid character in encodings other than Latin1. — Mladen Jablanović, Mar 21 '12 at 08:15
@ck3g The data is coming from a web request so that won't help. The app already thinks it's UTF-8 when it isn't. — dkam, Mar 21 '12 at 09:17
@MladenJablanović Yes - that would be a solution. Does such a dictionary exist? As 0xE7 exist in multiple encodings, you'd want to sort by most common I guess - unless there were multiple characters to narrow down the choice. — dkam, Mar 21 '12 at 09:17
You should assume UTF-8, see http://stackoverflow.com/questions/912811/what-is-the-proper-way-to-url-encode-unicode-characters — Christoffer Hammarström, Mar 22 '12 at 11:22

score 5 · Accepted Answer · edited Oct 07 '21 at 05:54

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
    query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1")    # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"

score 0 · Answer 2 · answered Mar 21 '12 at 10:31

0

It seems like it is an url encoded string. For reference here is a list of encoded characters: http://www.degraeve.com/reference/urlencoding.php

Unfortunately the CGI library has problems with utf-8, and if the unescape methods works well with some characters like space, it does not work well with others.

require'cgi'
a = "Fran%E7ois+Vergniolle+de+Chantal"
a= a.gsub('+', ' ').gsub('%E7','ç')
puts a
=> François Vergniolle de Chantal

a = "Fran%E7ois+Vergniolle+de+Chantal"
a = CGI::unescape(a) 
puts a
=> Franis Vergniolle de Chantal

Maybe you can implement your own method using gsub and the list of encoded characters?

answered Mar 21 '12 at 10:31

Aurélien Bottazini

3,249
17
26

@MladenJablanović if the string is UTF-8, you shouldn't need to force_encode to latin1 then encode to UTF-8 should you? Since %E7 is a small C with cedilla in both character sets? Further reading would suggest that %C3%A7 may be the correct encoding for this character under UTF-8, rather then %E7. – dkam Mar 21 '12 at 12:18
@Mladen Jablanović Your code does work but I do not like forcing encoding several times. Furthermore CGI::unescape starts with a string.tr('+', ' ').force_encoding(Encoding::ASCII_8BIT) and I don't like that at all :S – Aurélien Bottazini Mar 21 '12 at 12:39

Determine character encoding in Ruby 1.9.3

2 Answers2