I need to parse out the image URL from HTML much like the following:
<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>
So far I am using Nokogiri to parse out <h2>
tags with:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')
puts headers.text
I have two questions:
- How can I parse out the image url?
- Ideally I'd print to the console in this format:
1. Header 1 image_url 1 image_url 2 (if any) 2. Header 2 2image_url 1 2image_url 2 (if any)
And so far I haven't been able to print my headers in this nice format. How can I do so?
<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
<p class="post_author"><em>by</em> author</p>
<div class="format_text">
<p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3" target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2" target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3" target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8" target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
</div>
<p class="to_comments"><span class="date">February 15, 2013</span> <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>
and associate it with the header title.
– Steven Harlow Feb 20 '13 at 00:40