How to find certain text between quotes

Question

I'm trying to write a Ruby script that will take the Flickr BBCode from an image and only find the actual image link and ignore all of the other stuff.

The BBCode from Flickr looks like this:

<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>

and I'm trying to get my output to be just the link, so: https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg

So far, my code is this

#!/usr/bin/ruby

require 'rubygems'

str1 = ""

puts "What text would you like me to use? "
text = gets

text.scan(/"([^"]*)"/) { str1 = $1}

puts str1

and I need to know how I can scan through the input and only find the part that starts at https and ends with the quote. Any help is appreciated

score 2 · Answer 1 · edited May 23 '17 at 12:13

2

Don't try to parse HTML with a regex.

Instead, use an HTML parser. Something like Nokogiri http://nokogiri.org/

require 'nokogiri'
doc = Nokogiri::HTML.parse '<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>'

doc.css('a').each do |link|
  puts link.attr(:href)
end

edited May 23 '17 at 12:13

Community

1
1

answered Jun 19 '13 at 19:43

Alex Wayne

178,991
47
309
337

+1, yes, lest ["Russian hackers pwn your webapp"](http://stackoverflow.com/a/1732454/128421). – the Tin Man Jun 19 '13 at 19:56
Ok thanks. I didn't even know about that. Back to fix my code. Thanks for the help! – chanman82 Jun 19 '13 at 20:44

score 1 · Answer 2 · answered Jun 19 '13 at 19:44

1

You should really use a proper HTML parser if you're trying to parse HTML.

For example, this is trivial in Nokogiri:

require 'nokogiri'

bbcode = %Q[<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>]

Nokogiri::HTML(bbcode).css('a')[0]['href']
# => "http://www.flickr.com/photos/user/9049969465/"

You'll obviously have to add some error checking in there, but that's the basics.

answered Jun 19 '13 at 19:44

tadman

208,517
23
234
262

2

And, of course, `css('a')[0]` can be simplified to `at_css('a')`. – the Tin Man Jun 19 '13 at 19:54
@theTinMan yes, I did the same. :) – Arup Rakshit Jun 19 '13 at 19:55

Arup Rakshit · Answer 3 · 2013-06-19T19:56:48.960

require 'nokogiri'

doc = Nokogiri::HTML (<<-eol)
<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>
eol
doc.at_css("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"
doc.at("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"

How to find certain text between quotes

3 Answers3