Parsing document with special characters, using Nokogiri

Question

I'm using Nokogiri to parse a webpage which contains special characters, however the special characters do not get parsed correctly- showing up as "genealÃ³gica"

doc=Nokogiri::HTML(open("#{BASE_URL}search=#{book}#{chapters}&version=NVI")).css('.result-text-style-normal')
doc.css('.footnotes').remove
doc.css('h4').remove
doc

any ideas how I could fix this?

It would help if you'd show the actual URL to the site so we can check the headers of the page. — the Tin Man, Jan 15 '11 at 05:35
this is the url http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI — ryudice, Jan 15 '11 at 05:51
This is a character encoding issue, which has been answered here: http://stackoverflow.com/questions/2572396/nokogiri-and-special-characters — andrewle, Jan 15 '11 at 06:00

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

EDIT: I did a bit more work looking at the page, how you're trying to process it, and think this works better. I changed how you process the page also, because it wasn't as clear as how I like seeing it, for maintainability and readability.

require 'addressable/uri'
require 'nokogiri'
require 'open-uri'

def get_chapter(base_url, params={})
  uri = Addressable::URI.parse(base_url)
  uri.query_values = params

  doc = Nokogiri::XML(open(uri.to_s))
  doc.encoding = 'UTF-8'

  div = doc.at_css('.result-text-style-normal')
  div.css('.footnotes').remove
  div.css('h4').remove

  doc
end

page = get_chapter('http://www.biblegateway.com/passage/', :search => 'Mateo1-2', :version => 'NVI')
puts page.content

Rather than build a URL like you were, I prefer seeing it passed in as chunks, with the base URL and parameters split. I build the URI using the Addressable gem, which is my go-to for munging URLs. Ruby's built-in URI is having some growing pains right now, related to encoding of parameters.

The document at the far end of the URL you gave says it is XHTML, so it should meet the XHTML specs. You can parse XHTML using Nokogiri::HTML() but I think you get better results using Nokogiri::XML(), which is more strict.

To give Nokogiri an additional nudge in the right direction for parsing the content, I add:

doc.encoding = 'UTF-8'

I prefer finding the desired div and assigning it to a temporary variable, and working from that point, rather than doing it chained to the parse step like you did. It's a bit more idiomatic and readable this way because we're dealing with chunks of the document.

Running the code outputs what appears to be nice and clean content. There is some embedded Javascript, but that is unavoidable because Javascript is treated as text inside the <script> tags. That isn't an issue if you are presenting the HTML for a browser to render.

Thanks, this does look better and it taught some new things, I'm new to rails and ruby — ryudice, Jan 16 '11 at 00:16

score 0 · Answer 2 · answered Dec 15 '19 at 18:16

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) should help.

EXAMPLE:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

score -1 · Answer 3 · answered Jan 15 '11 at 19:08

-1

If you are using 1.9 you can simply put

coding: utf-8

At the top, nokogiri picks up the rest. If you need to make the data external again you can use iconv.

answered Jan 15 '11 at 19:08

Timon Vonk

439
3
11

Please explain how setting the source file encoding will help parse an external XHTML file. – the Tin Man Jan 15 '11 at 21:00

Parsing document with special characters, using Nokogiri

3 Answers3

coding: utf-8