27

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

At this point, the title looks like this:

Rag\303\271

Instead of:

Ragù

How can I have nokogiri return the proper character (e.g. ù in this case)?

Here's an example URL:

http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037

Phrogz
  • 296,393
  • 112
  • 651
  • 745
Moe
  • 641
  • 1
  • 7
  • 16
  • It would be of assistance to those who help if we could have the URL to the site so we can test against it. – Ryan Bigg Apr 03 '10 at 19:35
  • How do you inspect the title afterwards and which Ruby version you are using? `Rag\303\271` _is_ `Ragù` UTF-8-encoded. – Mladen Jablanović Apr 03 '10 at 19:51
  • Hi Mladen, I'm using Ruby 1.8.6. I'm inspecting the title from the Ruby interactive console. Ultimately, it ends up being stored in a MySQL database. Once in MySQL it looks like: ù – Moe Apr 03 '10 at 19:59

8 Answers8

62

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.

Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8

We can see that this is not the fault of open-uri:

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8

This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • 1
    Wow, I never would have figured out that the addition of `.read` would fix this. Thank you! –  Nov 04 '15 at 17:47
36

I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).

So, you just need to do:

doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')

and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.

orde
  • 5,233
  • 6
  • 31
  • 33
user660745
  • 361
  • 3
  • 3
11

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string for 1.8, you probably need to look at Iconv.

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

JasonTrue
  • 19,244
  • 4
  • 34
  • 61
  • Hi Jason, Thanks so much for all the help. Got it working perfectly. I set my MySQL DB encoding to UTF-8 as well as my terminal profile. – Moe Apr 03 '10 at 21:31
  • @Moe This might be 'handling' the issue, or it might be masking it. See my answer for how to cleanly ensure that Nokogiri is getting the right UTF-8 content. – Phrogz Jan 15 '11 at 21:02
6

Try setting the encoding option of Nokogiri, like so:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")
Koen.
  • 25,449
  • 7
  • 83
  • 78
2

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.

(The accented characters in your link came through fine in both, so don't know if this would help you with that.)

EXAMPLE:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
Yarin
  • 173,523
  • 149
  • 402
  • 512
  • That worked! I've been using Nokogiri for years and never knew about Nokogiri::HTML5. For those interested, you'll have to add the `nokogumbo` gem to your project. – Kyle Krzeski Apr 23 '20 at 02:12
  • Why `nokogumbo`? – Yarin Apr 23 '20 at 12:55
  • 1
    It's the only way to use `Nokogiri::HTML5` (https://github.com/rubys/nokogumbo) otherwise you'll get an error message that the Nokogiri HTML5 module was not found. `Module: Nokogiri::HTML5 is defined in: lib/nokogumbo/html5.rb` ( https://www.rubydoc.info/gems/nokogumbo/Nokogiri/HTML5) which means you'll need the `nokogumbo` gem. – Kyle Krzeski Apr 23 '20 at 12:58
1

You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.

as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:

require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))

Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

Nakul
  • 1,574
  • 11
  • 13
  • From the sample provided, it's clear that his content is already in UTF-8. – JasonTrue Apr 08 '10 at 06:22
  • nope it isn't. else he would get ù only. the webpage is not utf-8 encoded – Nakul Apr 08 '10 at 13:50
  • \303\271 are c-escaped UTF-8 byte values, which is how they appear in IRB when you look at an evaluated string; it's octal for C3 B9, which is the UTF-8 sequence for ù. If it were iso-8859-1, he would have gotten the octal for F9, or \371. – JasonTrue Apr 09 '10 at 23:26
  • but then, why would it look like ù in mysql? As i understand its irb not able or display it utf-8, right? – Nakul Apr 13 '10 at 10:36
  • That was a separate problem, which I explained in my answer. Mysql collation needs to be set for UTF-8 on the table that you're storing data in. IRB can display UTF-8 text on appropriate terminals, but it won't display evaluated expressions as UTF-8. It shows evaluated expressions as ASCII + Octal escaped sequences. ("puts" may will behave differently. See `puts "\001"` vs `"\001"` in irb for an example that isn't UTF-8 specific.) – JasonTrue Apr 21 '10 at 05:53
  • See my 'answer' in progress: the headers are utf-8, the data is utf-8, open-uri returns utf-8, but Nokogiri flubs it. – Phrogz Jan 15 '11 at 20:48
0

Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.

Check it out: https://github.com/tiagopog/scrapifier

Hope it's useful for you.

Tiago G.
  • 119
  • 2
  • 3
0

Just to add a cross-reference, this SO page gives some related information:

How to make Nokogiri transparently return un/encoded Html entities untouched?

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303