1

I'm trying to write a Ruby script that will take the Flickr BBCode from an image and only find the actual image link and ignore all of the other stuff.

The BBCode from Flickr looks like this:

<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>

and I'm trying to get my output to be just the link, so: https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg

So far, my code is this

#!/usr/bin/ruby

require 'rubygems'

str1 = ""

puts "What text would you like me to use? "
text = gets

text.scan(/"([^"]*)"/) { str1 = $1}

puts str1

and I need to know how I can scan through the input and only find the part that starts at https and ends with the quote. Any help is appreciated

chanman82
  • 23
  • 2

3 Answers3

2

Don't try to parse HTML with a regex.

Instead, use an HTML parser. Something like Nokogiri http://nokogiri.org/

require 'nokogiri'
doc = Nokogiri::HTML.parse '<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>'

doc.css('a').each do |link|
  puts link.attr(:href)
end
Community
  • 1
  • 1
Alex Wayne
  • 178,991
  • 47
  • 309
  • 337
1

You should really use a proper HTML parser if you're trying to parse HTML.

For example, this is trivial in Nokogiri:

require 'nokogiri'

bbcode = %Q[<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>]

Nokogiri::HTML(bbcode).css('a')[0]['href']
# => "http://www.flickr.com/photos/user/9049969465/"

You'll obviously have to add some error checking in there, but that's the basics.

tadman
  • 208,517
  • 23
  • 234
  • 262
0
require 'nokogiri'

doc = Nokogiri::HTML (<<-eol)
<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>
eol
doc.at_css("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"
doc.at("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"
Arup Rakshit
  • 116,827
  • 30
  • 260
  • 317