EDIT: I did a bit more work looking at the page, how you're trying to process it, and think this works better. I changed how you process the page also, because it wasn't as clear as how I like seeing it, for maintainability and readability.
require 'addressable/uri'
require 'nokogiri'
require 'open-uri'
def get_chapter(base_url, params={})
uri = Addressable::URI.parse(base_url)
uri.query_values = params
doc = Nokogiri::XML(open(uri.to_s))
doc.encoding = 'UTF-8'
div = doc.at_css('.result-text-style-normal')
div.css('.footnotes').remove
div.css('h4').remove
doc
end
page = get_chapter('http://www.biblegateway.com/passage/', :search => 'Mateo1-2', :version => 'NVI')
puts page.content
Rather than build a URL like you were, I prefer seeing it passed in as chunks, with the base URL and parameters split. I build the URI using the Addressable gem, which is my go-to for munging URLs. Ruby's built-in URI is having some growing pains right now, related to encoding of parameters.
The document at the far end of the URL you gave says it is XHTML, so it should meet the XHTML specs. You can parse XHTML using Nokogiri::HTML()
but I think you get better results using Nokogiri::XML()
, which is more strict.
To give Nokogiri an additional nudge in the right direction for parsing the content, I add:
doc.encoding = 'UTF-8'
I prefer finding the desired div and assigning it to a temporary variable, and working from that point, rather than doing it chained to the parse step like you did. It's a bit more idiomatic and readable this way because we're dealing with chunks of the document.
Running the code outputs what appears to be nice and clean content. There is some embedded Javascript, but that is unavoidable because Javascript is treated as text inside the <script>
tags. That isn't an issue if you are presenting the HTML for a browser to render.