Data scraping with Nokokiri and Pismo

Question

I'm working in a small app to save bookmarks. I use Nokogiri and Pismo (separately) to crawl a webpage to get the title tag.

Nokogiri doesn't save Japanese, Chinese, Russian or any language with unusual characters, Pismo in the other hand saves this characters from this languages, but it's a little slow and it doesn't save title information as well as Nokogiri.

Could anyone body recommend a better gem or a better way to save that data?

doc = Nokogiri::HTML(open(bookmark_params[:link]))

@bookmark = current_user.bookmarks.build(bookmark_params)
@bookmark.title = doc.title.to_s

this is what I mean by "weird characters"

if I use nokogiri in the link below to scan for the page title

youtube.com/watch?v=QXAwnMxlE2Q
this is what I get.

NTV interview foreigners in Japan æ¥ãã¬å¤äººè¡é ã¤ã³ã¿ãã¥ã¼ Eng...

but using pismo gem this is what I get.

NTV interview foreigners in Japan 日テレ外人街頭インタビュー English Subtitles 英語字幕

which is the actual result I want. but the gem is a bit slower.

What do you mean that Nokogiri doesn't save "weird" characters? Add a minimal example of HTML that duplicates this. Also, can you come up with a better title for your question that indicates the problem? That helps people select questions to work on. — the Tin Man, Jan 01 '15 at 16:33
Please update your question with that information instead of adding it as a comment. When we answer, it helps us greatly to be able to look in one place, the question, for everything we need to know. Be sure to format it so it is easily readable also. Thank you. — the Tin Man, Jan 06 '15 at 20:13

score 1 · Answer 1 · edited Jan 03 '15 at 20:01

According to my experience, in the case of encoding problems with Nokogiri or RestClient or other web-scraping gems, it helps to find what encoding the document says it uses.

This info is usually located at the meta-tag:

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1251">

This will not always be true as the actual encoding may be different from what the tag suggests, but it's worth a try if you can find a meta-tag at all. Or, you could try a few different encodings.

Then get a response:

doc = Nokogiri::HTML(open-uri(http://example.com))

and try:

doc.force_encoding('Windows-1251').encode('UTF-8')

Or, setting Nokogiri's encoding explicitly could help:

doc = Nokogiri.XML(open-uri(http://example.com), nil, 'Windows-1251')

score 0 · Accepted Answer · edited May 23 '17 at 11:58

See Phrogz answer here: Nokogiri, open-uri, and Unicode Characters which I think correctly describes what is happening for you. In summary, for some reason there is an issue passing the IO object created by open-url to nokogiri. Instead read the document in as a string and give that to Nokogiri, i.e.:

require 'nokogiri'
require 'open-uri'

open("https://www.youtube.com/watch?v=QXAwnMxlE2Q") {|f|
  p f.content_type     # "text/html"
  p f.charset          # "UTF-8"
  p f.content_encoding # []
}

doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"))
puts doc.title.to_s # =>  NTV interview foreigners in Japan æ¥ãã¬å¤äººè¡é ã¤ã³ã¿ãã¥ã¼ English Subtitles è±èªåå¹ - YouTube


doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q").read)
puts doc.title.to_s # => NTV interview foreigners in Japan 日テレ外人街頭インタビュー English Subtitles 英語字幕 - YouTube

If you know the content is always going to be UTF-8 you could of course to this:

doc = Nokogiri::HTML(open("https://www.youtube.com/watch?v=QXAwnMxlE2Q"), nil, "UTF-8")

Data scraping with Nokokiri and Pismo

2 Answers2