Nokogiri select hyperlinks from XML::NodeSet

Question

I have written the following simple script to parse reddit/r/documentaries

require 'open-uri'
require 'nokogiri'

top_docs = Nokogiri::XML(open("http://www.reddit.com/r/Documentaries/top.rss"))
top_docs.xpath('//item').each do |documentary|
    documentary_description = documentary.xpath('description')
end

I am trying to gather an array of all the hyperlinks within documentary_description. What selector / method should I use to accomplish this?

Thanks

score 3 · Accepted Answer · answered Nov 24 '12 at 14:02

3

You can use the extract method provided by URI:

top_docs.xpath('//item').each do |documentary|
  documentary_description = documentary.xpath('description')
  links = URI.extract(documentary_description.text)
  ...
end

answered Nov 24 '12 at 14:02

Chris Salzberg

27,099
4
75
82

To give credit where credit is due: I discovered the `extract` method from this SO answer: http://stackoverflow.com/questions/3665072/extract-url-from-text#9716632 Might use it myself in the future, very handy! – Chris Salzberg Nov 24 '12 at 14:05
Why not just `search('a')`, are the links in plaintext? – akuhn Nov 27 '12 at 04:54

score 2 · Answer 2 · answered Nov 24 '12 at 14:31

2

One-liner (using the handy URI#extract noted by @shioyama):

links = URI.extract(top_docs.xpath('//item/description').to_a.join(" "))

answered Nov 24 '12 at 14:31

Mark Thomas

37,131
11
74
101

score 1 · Answer 3 · answered Nov 24 '12 at 22:48

1

Be careful with URI#extract, In this case it picks up a probably unwanted img src. Nokogiri is more reliable:

links = Nokogiri::HTML(documentary_description.text).search('a').map{|x| x[:href]}

answered Nov 24 '12 at 22:48

pguardiario

53,827
19
119
159

Of course, one could also add `*[not(self::img)]` to any XPath and exclude them from `extract` that way. – Mark Thomas Nov 24 '12 at 23:49

Nokogiri select hyperlinks from XML::NodeSet

3 Answers3