1

I have a rails3 application running on ruby 1.9 here, and Im having some pain making encodings work.

My task was to open a remote html page, and parse some information from it. all my code and database are in UTF-8, im using the # code: UTF-8, mysql fix, and so on.

The page I open, is in charset ISO-8859-1, and when my parser find strage characters it complains its not a valid UTF-8 one.

I tryed to use .force_encoding("UTF-8") in all strings I've parsed, but it still. When I try to convert the whole page, I get this:

a = open("someurl")
b = a.read.encode("UTF-8")
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8
    from (irb):7:in `encode'
    from (irb):7
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:44:in `start'
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:8:in `start'
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands.rb:23:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

how could I fix this? it seems it already went wrong when he "converted" the iso8859 page to ascii.

UPDATE

I tryed opening the url using 'r:iso-8859-1:utf-8', but apparently my problem now is with Hpricot, that I use for parsing.

>a = open(b, 'r:iso-8859-1:utf-8')
>a.read.encoding
 => #<Encoding:UTF-8>
> Hpricot(a).inner_html.encoding
 => #<Encoding:ASCII-8BIT> 

and all the errors again... probably this is an hpricot issue, but if anyone knows a fix, please.

Tiago
  • 2,966
  • 4
  • 33
  • 41
  • Does it work any better if you use nokogiri instead of hpricot? – dkarp Jan 16 '11 at 04:05
  • well, you can convert the inner_html to UTF-8 using force_encoding, but if you try using inner_text, force_encoding give you an error. The work around is to use either inner_html or inner_content instead of inner_text – Tiago Jan 16 '11 at 04:25
  • I'll give nokogiri a try here! thanks! – Tiago Jan 16 '11 at 04:26

2 Answers2

1

Hpricot - UTF-8 issues invalid byte sequence in UTF-8 (ArgumentError)

require 'hpricot'
require 'open-uri'

doc = open('http://www.amazon.co.jp/') { |f| Hpricot(f.read) }

puts doc.to_html

open('http://www.amazon.co.jp/') { |f| Hpricot(f.read.encode("UTF-8")) }
Hauleth
  • 22,873
  • 4
  • 61
  • 112
Dipak Panchal
  • 5,996
  • 4
  • 32
  • 68
0
a = open("someurl", "r:iso-8859-1:utf-8")

See this other SO question for more details...

Community
  • 1
  • 1
dkarp
  • 14,483
  • 6
  • 58
  • 65