Best way to parse a file with links exported from Delicious.com using Nokogiri?

Question

I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.

I do the following to get the link information:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

My question is how do I get the link information AND the comment when a dd tag is present?

You want to get the text for the `` tags followed by `
` tags? Please give a sample of the desired output. — the Tin Man, Dec 18 '10 at 08:54
It might be more convenient to export the data using: `curl -o output.xml --user yourusername https://api.del.icio.us/v1/posts/all` and parse the xml-file. — jfs, Dec 18 '10 at 09:52
Or use typhoeus to retrieve it and pass the body directly to Nokogiri. — the Tin Man, Dec 18 '10 at 10:55
Good question, +1. See my answer for an XPath expression that selects exactly the wanted elements. :) — Dimitre Novatchev, Dec 18 '10 at 17:54
If you can, try to use delicious data through API, it should be simpler than scrapping HTML. There are ruby tools for that. https://github.com/weppos/www-delicious — fifigyuri, Dec 18 '10 at 17:57

score 3 · Accepted Answer · answered Dec 18 '10 at 17:53

My question is how do I get the link information AND the comment when a dd tag is present?

Use:

//DT/a | //DT[a]/following-sibling::*[1][self::DD]

This selects all a elements that have a DT parent and all DD elements that are the immediate following sibling element of a DT element that has an a child.

Note: The use of the // is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.

Whenever the structure of the XML document is known, avoid using the // abbreviation.

Sweet! I had to make DT's and the DD's lowercase, then it worked like a charm. — magnushjelm, Dec 18 '10 at 18:46

the Tin Man · Answer 2 · 2010-12-18T09:31:30.390

Your question isn't clear about what you are looking for.

First, the HTML is malformed because the <DT> tags are not closed correctly, and there is an illegal character in the first a tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.

html = %{
<DT>
  <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue &sect; Sprite Optimization</A>
<DT>
  <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
  <A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}

That HTML parses to this in Nokogiri after it tries to fix it up:

(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
  <a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
  <a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
  <a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>

Notice how the closing dt tags are grouped just before the only dd tag? That's icky, but ok because it doesn't change how we have to look for the dd content.

doc = Nokogiri::HTML(html, nil, 'UTF-8')

comments = []
doc.css('dt + dd').each do |a|
  comments << a.text
end
puts comments

# >> Window shopping from Amazon

That means, find <dt> followed by <dd>. You don't/can't look for dt followed by a followed by dd because that's not how the HTML parses. It would really be dt followed by dd, which is what "dt + dd" means.

The other way it seemed like your question could read was that you were looking for the content of the a tags:

comments = []
doc.css('a').each do |a|
  comments << a.text
end
puts comments

# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta

I have clarified my question. I would like the output to be the content of the a-tag directly followed by the content of the dd-tag when the dd-tag is present. — magnushjelm, Dec 18 '10 at 11:51

Jack Chu · Answer 3 · 2010-12-18T09:55:53.557

0

I'm assuming the:

<DD>Window shopping from Amazon

has an ending /DD tag, I can't tell from just your snippet of the page. If so, you could do:

comment = node.parent.next_sibling.next_sibling.text rescue nil

You need to call next_sibling twice because the first one will match a \n (new line) or whitespace. You could remove all the new lines prior to parsing the page to avoid the double call. That might also be a good idea in case there's more than 1 new line character after the DT tag

edited Dec 18 '10 at 09:55

answered Dec 18 '10 at 09:41

Jack Chu

6,791
4
38
44

I just realized the DT tags aren't closed, I just assumed it was at the end of the line and I didn't scroll. In that case, next_sibling might not work as expected. – Jack Chu Dec 18 '10 at 09:53

Best way to parse a file with links exported from Delicious.com using Nokogiri?

3 Answers3