Extracting all URLs from a page using Ruby

Question

I am trying to extract all the URLs from the raw output of some Ruby code:

require 'open-uri'

reqt = open("http://www.google.com").read
reqt.each_line { |line|
 if line =~/http/ then
 puts URI.extract(line)
 end }

What am I doing wrong? I am getting extra lines along with URLs.

See [this comment](http://stackoverflow.com/questions/3665072/extract-url-from-text#comment29789408_9716632) on the duplicated question's [most popular answer](http://stackoverflow.com/a/9716632/182590). — Mark Thomas, Aug 02 '14 at 13:55

score 1 · Answer 1 · answered Aug 02 '14 at 12:54

1

You can do this instead:

require 'open-uri'
reqt = open("http://www.google.com").read
urls = reqt.scan(/[[:lower:]]+:\/\/[^\s"]+/)

answered Aug 02 '14 at 12:54

konsolebox

72,135
12
99
105

score 1 · Accepted Answer · edited Mar 01 '16 at 18:52

1

Remember the URL doesn't have to start with "http" - it could be a relative URL, the path to the current page. IMO it is the best to use Nokogiri to parse the HTML:

require 'open-uri'
require 'nokogiri'
reqt = open("http://www.google.com")
doc = Nokogiri::HTML(reqt)
doc.xpath('//a[@href]').each do |a|
  puts a.attr('href')
end

But if you really want to find only the absolute URLs, add a simple condition:

 puts a.attr('href') if a.attr('href') =~ /^http/i

edited Mar 01 '16 at 18:52

the Tin Man

158,662
42
215
303

answered Aug 02 '14 at 13:37

Grych

2,861
13
22

Thanks grych, if don't mind can you please explain the doc.xpath() part please. – 0xr3d0c Aug 03 '14 at 13:05
You're welcome. Please don't forget to accept the best answer, as konsolebox metioned in the other comment. – Grych Aug 03 '14 at 13:40
XPATH is a method to navigate through XML tree. [Here](http://zvon.org/xxl/XPathTutorial/Output/example1.html) is a complete tutorali, please take a look. `xpath('//a[@href]')` means "take all tags which have href attribute" – Grych Aug 03 '14 at 13:42

Extracting all URLs from a page using Ruby

2 Answers2