1

I need to make a program in ruby that given an Internet address by command line argument (argv) return a list of the images found (the images in HTML respond to the tag "") and a list of all the links that are internet addresses to link other pages (the links in HTML respond to the label )

So far separate the string of the code of the page with the signs> and

The code at the moment

require 'net/http'
pagina= Net::HTTP.get(ARGV[0], '/index.html')
xx = pagina.split(/[<,>]/)
puts xx
puts xx.scan(/a href=/)
  • 2
    H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ – Tom Lord Nov 06 '17 at 16:49
  • `< a alt="Image" href="/example.png" />` will be utterly invisible to your regular expression approach. **Use a proper parsing library** because HTML is deceptively hard. – tadman Nov 06 '17 at 17:01

1 Answers1

2

Do not use regular expressions to parse HTML.

Use an HTML parser. For example, Nokogiri:

require 'net/http'
require 'nokogiri'

pagina = Net::HTTP.get(ARGV[0], '/index.html')
Nokogiri::HTML(pagina).css('a').map { |link| link['href'] }
Tom Lord
  • 27,404
  • 4
  • 50
  • 77