10

I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following

  1. Retrieve an HTML page (via net/http)
  2. Create a Nokogiri document from the response.body
  3. Extract some info and send it back in the response. The response should be UTF-8 encoded

So I came to the problem while trying to read sites that use windows-1256 encodings like www.filfan.com or www.masrawy.com.

The problem is the result of the encoding conversion is not correct though no errors are thrown.

The net/http response.body.encoding gives ASCII-8BIT which can not be converted to UTF-8

If I do Nokogiri::HTML(response.body) and use the css selectors to get certain content from the page - say the content of the title tag for example - I get a string which when i call string.encoding returns WINDOWS-1256. I use string.encode("utf-8") and send the response using that but again the response is not correct.

Any suggestions or ideas about what's wrong in my approach?

Nakilon
  • 34,866
  • 14
  • 107
  • 142
humanzz
  • 897
  • 2
  • 10
  • 18

2 Answers2

28

Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567

You can parse response['content-type'] which contains charset instead of parsing whole response.body.

Then use force_encoding() to set right encoding.

response.body.force_encoding("UTF-8") if site is served in UTF-8.

A.D.
  • 4,487
  • 3
  • 38
  • 50
  • Although this solution does work, this issue only happened to me for certain sites. Perhaps when the Content-Type includes 'application/json', then it does encode in UTF-8...? According to http://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean, application/json implies UTF-8. – B Seven May 28 '14 at 14:40
  • 1
    The next logical step would be to call .encode!('UTF-8') on resulting string and then do the actual processing – Dmitry Vyal Jun 07 '14 at 08:06
  • @DmitryVyal You've saved my day mate – NoDisplayName Dec 22 '14 at 10:51
3

I found the following code working for me now

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end 
humanzz
  • 897
  • 2
  • 10
  • 18