0

I have written the following simple script to parse reddit/r/documentaries

require 'open-uri'
require 'nokogiri'

top_docs = Nokogiri::XML(open("http://www.reddit.com/r/Documentaries/top.rss"))
top_docs.xpath('//item').each do |documentary|
    documentary_description = documentary.xpath('description')
end

I am trying to gather an array of all the hyperlinks within documentary_description. What selector / method should I use to accomplish this?

Thanks

Karl Entwistle
  • 933
  • 2
  • 13
  • 25

3 Answers3

3

You can use the extract method provided by URI:

top_docs.xpath('//item').each do |documentary|
  documentary_description = documentary.xpath('description')
  links = URI.extract(documentary_description.text)
  ...
end
Chris Salzberg
  • 27,099
  • 4
  • 75
  • 82
  • To give credit where credit is due: I discovered the `extract` method from this SO answer: http://stackoverflow.com/questions/3665072/extract-url-from-text#9716632 Might use it myself in the future, very handy! – Chris Salzberg Nov 24 '12 at 14:05
  • Why not just `search('a')`, are the links in plaintext? – akuhn Nov 27 '12 at 04:54
2

One-liner (using the handy URI#extract noted by @shioyama):

links = URI.extract(top_docs.xpath('//item/description').to_a.join(" "))
Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
1

Be careful with URI#extract, In this case it picks up a probably unwanted img src. Nokogiri is more reliable:

links = Nokogiri::HTML(documentary_description.text).search('a').map{|x| x[:href]}
pguardiario
  • 53,827
  • 19
  • 119
  • 159