2

I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.

I do the following to get the link information:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

My question is how do I get the link information AND the comment when a dd tag is present?

magnushjelm
  • 476
  • 3
  • 16

3 Answers3

3

My question is how do I get the link information AND the comment when a dd tag is present?

Use:

//DT/a | //DT[a]/following-sibling::*[1][self::DD]

This selects all a elements that have a DT parent and all DD elements that are the immediate following sibling element of a DT element that has an a child.

Note: The use of the // is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.

Whenever the structure of the XML document is known, avoid using the // abbreviation.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
2

Your question isn't clear about what you are looking for.

First, the HTML is malformed because the <DT> tags are not closed correctly, and there is an illegal character in the first a tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.

html = %{
<DT>
  <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue &sect; Sprite Optimization</A>
<DT>
  <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
  <A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}

That HTML parses to this in Nokogiri after it tries to fix it up:

(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
  <a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
  <a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
  <a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>

Notice how the closing dt tags are grouped just before the only dd tag? That's icky, but ok because it doesn't change how we have to look for the dd content.

doc = Nokogiri::HTML(html, nil, 'UTF-8')

comments = []
doc.css('dt + dd').each do |a|
  comments << a.text
end
puts comments

# >> Window shopping from Amazon

That means, find <dt> followed by <dd>. You don't/can't look for dt followed by a followed by dd because that's not how the HTML parses. It would really be dt followed by dd, which is what "dt + dd" means.

The other way it seemed like your question could read was that you were looking for the content of the a tags:

comments = []
doc.css('a').each do |a|
  comments << a.text
end
puts comments

# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • I have clarified my question. I would like the output to be the content of the a-tag directly followed by the content of the dd-tag when the dd-tag is present. – magnushjelm Dec 18 '10 at 11:51
0

I'm assuming the:

<DD>Window shopping from Amazon

has an ending /DD tag, I can't tell from just your snippet of the page. If so, you could do:

comment = node.parent.next_sibling.next_sibling.text rescue nil

You need to call next_sibling twice because the first one will match a \n (new line) or whitespace. You could remove all the new lines prior to parsing the page to avoid the double call. That might also be a good idea in case there's more than 1 new line character after the DT tag

Jack Chu
  • 6,791
  • 4
  • 38
  • 44
  • I just realized the DT tags aren't closed, I just assumed it was at the end of the line and I didn't scroll. In that case, next_sibling might not work as expected. – Jack Chu Dec 18 '10 at 09:53