3

I need to parse out the image URL from HTML much like the following:

<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>

So far I am using Nokogiri to parse out <h2> tags with:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')

puts headers.text

I have two questions:

  1. How can I parse out the image url?
  2. Ideally I'd print to the console in this format:
 1. 
Header 1
image_url 1
image_url 2 (if any)
 2. 
Header 2
2image_url 1
2image_url 2 (if any)

And so far I haven't been able to print my headers in this nice format. How can I do so?

<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
          <p class="post_author"><em>by</em> author</p>
          <div class="format_text">
    <p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3"  target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2"  target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3"  target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8"  target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
        </div>  
          <p class="to_comments"><span class="date">February 15, 2013</span> &nbsp; <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Steven Harlow
  • 631
  • 2
  • 11
  • 26

4 Answers4

6

I think it makes more sense to group by h2 first:

doc.search('h2').each_with_index do |h2, i|
  puts "#{i+1}."
  puts h2.text
  h2.search('+ p + div > p[3] img').each do |img|
    puts img['src']
  end
end
pguardiario
  • 53,827
  • 19
  • 119
  • 159
5

To get images, simply look for the img tags with a src attribute.

If you want the h2 associated with each image, you can do this:

doc.xpath('//img').each do |img|
  puts "Header: #{img.xpath('preceding::h2[1]').text}"
  puts "  Image: #{img['src']}"
end

Note that a switch to XPath was in order for the preceding:: axis.

EDIT

To group by header, you can put them in a hash:

headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
  header = img.xpath('preceding::h2[1]').text
  image = img['src']
  headers[header] << image
end

To get the output you've prescribed:

headers.each do |h,urls|
  puts "#{h} #{urls.join(' ')}"
end
Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • Cool, is there an opposite method to "preceding"? Such as following? – Steven Harlow Feb 20 '13 at 01:31
  • This is very helpful, but I'm actually interested in the opposite. The headers and then the following images. Is there a way to do that using a technique close to this one you provide? – Steven Harlow Feb 20 '13 at 02:01
  • I tried: doc.xpath('//h2/a[@rel = "bookmark"]').each do |header| puts "Header: #{header.text}" puts " Image: #{header.xpath('following::img[1]')['src']}" end but I get a "Can't convert String into Integer (TypeError) – Steven Harlow Feb 20 '13 at 02:07
  • No, you can't do the opposite because you'd get all the images whether or not there is an `h2` in between. You can still do it the way I show and group the images under each header. – Mark Thomas Feb 20 '13 at 03:04
  • When I do the code two comments above with puts " Image 1: #{header.xpath('following::img[1]').to_s}" I get image2 – Steven Harlow Feb 20 '13 at 03:37
  • I just need to pull out that src now – Steven Harlow Feb 20 '13 at 03:38
  • I came back to accept your answer, but then saw pguardiario's that was closer to what I ended up doing. Thank you so much for the help! It allowed me to find what I wanted myself. – Steven Harlow Feb 22 '13 at 01:28
0

Code that I ended up using. Feel free to critique (I'll probably learn from it):

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://blog.website.com/"))

doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
  puts i+1
  puts " Title: #{header.text}"
  puts "  Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
  puts "  Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end
Steven Harlow
  • 631
  • 2
  • 11
  • 26
  • No, `following::img` will pick up an image that's past the next h2 and the [0]["src"] will cause errors if it doesn't exist. Also, use css when possible. – pguardiario Feb 22 '13 at 02:00
  • This code works on the webpage I'm using, whereas the code you provide leaves out a few of the images (though with some adjustments I'm sure it'd work). I'm sure it's because you don't have the full information. – Steven Harlow Feb 22 '13 at 04:05
  • The reason following::img[1] doesn't skip a h2 is because there are facebook like images as img[0], which I don't care about. The formatting is luckily consistent across the page. – Steven Harlow Feb 22 '13 at 04:06
0

I did something similiar once (I wanted the exact same output actually). This solution is pretty easy to follow:

Depending on how the DOM is structured, you could do something like:

body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1

body.each_with_index do |body,index| 
   header = headers[index]
   puts "#{post_counter}. " + header
   body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/\Ahttp/) }
   post_counter += 1
end

So basically, you're checking every header with 1 or more images. The page I was parsing had the headers outside of the image divs, which is why I used two different variables to find them (body / headers). Also, I targeted two classes when looking for images, as this is the way this particular DOM was structured.

This should give you a nice clean output like you wanted.

Hope this helps!

Laura M
  • 11
  • 1