Parse image url nokogiri

Question

I need to parse out the image URL from HTML much like the following:

<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>

So far I am using Nokogiri to parse out <h2> tags with:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')

puts headers.text

I have two questions:

How can I parse out the image url?
Ideally I'd print to the console in this format:

 1. 
Header 1
image_url 1
image_url 2 (if any)
 2. 
Header 2
2image_url 1
2image_url 2 (if any)

And so far I haven't been able to print my headers in this nice format. How can I do so?

<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
          <p class="post_author"><em>by</em> author</p>
          <div class="format_text">
    <p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3"  target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2"  target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3"  target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8"  target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
        </div>  
          <p class="to_comments"><span class="date">February 15, 2013</span> &nbsp; <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>

possible duplicate of [Image scraping in Ruby](http://stackoverflow.com/questions/8956249/image-scraping-in-ruby) — Mark Thomas, Feb 20 '13 at 00:14
Sample HTML would help with the part of your question where you want to associate images with their header. — Mark Thomas, Feb 20 '13 at 00:25
I added some sample html (lorem ipsums added and website hidden). I'm looking to parse the image in the third
and associate it with the header title. — Steven Harlow, Feb 20 '13 at 00:40

pguardiario · Accepted Answer · 2013-02-20T08:21:56.277

6

I think it makes more sense to group by h2 first:

doc.search('h2').each_with_index do |h2, i|
  puts "#{i+1}."
  puts h2.text
  h2.search('+ p + div > p[3] img').each do |img|
    puts img['src']
  end
end

edited Feb 20 '13 at 08:21

answered Feb 20 '13 at 07:41

pguardiario

53,827
19
119
159

This won't get all images; only the ones within the exact structure shown in his one example. – Mark Thomas Feb 20 '13 at 10:20
Right. That's what he asked for. – pguardiario Feb 20 '13 at 11:20
Well, he doesn't have an example of where the "image url 2 (if any)" can occur. – Mark Thomas Feb 20 '13 at 12:30
He specifically says the 3rd p. But it doesn't matter, it's easy to adjust to fit the situation. – pguardiario Feb 20 '13 at 13:00
Can this be adjusted to accommodate *any* images up to the next `h2`? I tried to find a way, and the only thing I came up with is the inside out solution I posted. – Mark Thomas Feb 20 '13 at 18:51
You're making this more complicated than it needs to be. – pguardiario Feb 21 '13 at 00:06
I don't think my answer's really that complicated, just more flexible. – Mark Thomas Feb 21 '13 at 00:44
If the goal is flexibility, then that's great. Usually, for things like this though, the goal is accuracy. – pguardiario Feb 21 '13 at 02:29
This is very close to what I ended up using (even including the each_with_index which I found later). Thanks! – Steven Harlow Feb 22 '13 at 01:29

Mark Thomas · Answer 2 · 2013-02-20T10:28:25.200

5

To get images, simply look for the img tags with a src attribute.

If you want the h2 associated with each image, you can do this:

doc.xpath('//img').each do |img|
  puts "Header: #{img.xpath('preceding::h2[1]').text}"
  puts "  Image: #{img['src']}"
end

Note that a switch to XPath was in order for the preceding:: axis.

EDIT

To group by header, you can put them in a hash:

headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
  header = img.xpath('preceding::h2[1]').text
  image = img['src']
  headers[header] << image
end

To get the output you've prescribed:

headers.each do |h,urls|
  puts "#{h} #{urls.join(' ')}"
end

edited Feb 20 '13 at 10:28

answered Feb 20 '13 at 01:10

Mark Thomas

37,131
11
74
101

Cool, is there an opposite method to "preceding"? Such as following? – Steven Harlow Feb 20 '13 at 01:31
This is very helpful, but I'm actually interested in the opposite. The headers and then the following images. Is there a way to do that using a technique close to this one you provide? – Steven Harlow Feb 20 '13 at 02:01
I tried: doc.xpath('//h2/a[@rel = "bookmark"]').each do |header| puts "Header: #{header.text}" puts " Image: #{header.xpath('following::img[1]')['src']}" end but I get a "Can't convert String into Integer (TypeError) – Steven Harlow Feb 20 '13 at 02:07
No, you can't do the opposite because you'd get all the images whether or not there is an `h2` in between. You can still do it the way I show and group the images under each header. – Mark Thomas Feb 20 '13 at 03:04
When I do the code two comments above with puts " Image 1: #{header.xpath('following::img[1]').to_s}" I get – Steven Harlow Feb 20 '13 at 03:37
I just need to pull out that src now – Steven Harlow Feb 20 '13 at 03:38
I came back to accept your answer, but then saw pguardiario's that was closer to what I ended up doing. Thank you so much for the help! It allowed me to find what I wanted myself. – Steven Harlow Feb 22 '13 at 01:28

score 0 · Answer 3 · answered Feb 22 '13 at 01:26

0

Code that I ended up using. Feel free to critique (I'll probably learn from it):

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://blog.website.com/"))

doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
  puts i+1
  puts " Title: #{header.text}"
  puts "  Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
  puts "  Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end

answered Feb 22 '13 at 01:26

Steven Harlow

631
2
11
26

No, `following::img` will pick up an image that's past the next h2 and the [0]["src"] will cause errors if it doesn't exist. Also, use css when possible. – pguardiario Feb 22 '13 at 02:00
This code works on the webpage I'm using, whereas the code you provide leaves out a few of the images (though with some adjustments I'm sure it'd work). I'm sure it's because you don't have the full information. – Steven Harlow Feb 22 '13 at 04:05
The reason following::img[1] doesn't skip a h2 is because there are facebook like images as img[0], which I don't care about. The formatting is luckily consistent across the page. – Steven Harlow Feb 22 '13 at 04:06

score 0 · Answer 4 · answered Oct 23 '13 at 00:02

I did something similiar once (I wanted the exact same output actually). This solution is pretty easy to follow:

Depending on how the DOM is structured, you could do something like:

body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1

body.each_with_index do |body,index| 
   header = headers[index]
   puts "#{post_counter}. " + header
   body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
   post_counter += 1
end

So basically, you're checking every header with 1 or more images. The page I was parsing had the headers outside of the image divs, which is why I used two different variables to find them (body / headers). Also, I targeted two classes when looking for images, as this is the way this particular DOM was structured.

This should give you a nice clean output like you wanted.

Hope this helps!

Parse image url nokogiri

4 Answers4