Your question isn't clear about what you are looking for.
First, the HTML is malformed because the <DT>
tags are not closed correctly, and there is an illegal character in the first a
tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.
html = %{
<DT>
<A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
<A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
<A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}
That HTML parses to this in Nokogiri after it tries to fix it up:
(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
<a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
<a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
<a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>
Notice how the closing dt
tags are grouped just before the only dd
tag? That's icky, but ok because it doesn't change how we have to look for the dd
content.
doc = Nokogiri::HTML(html, nil, 'UTF-8')
comments = []
doc.css('dt + dd').each do |a|
comments << a.text
end
puts comments
# >> Window shopping from Amazon
That means, find <dt>
followed by <dd>
. You don't/can't look for dt
followed by a
followed by dd
because that's not how the HTML parses. It would really be dt
followed by dd
, which is what "dt + dd
" means.
The other way it seemed like your question could read was that you were looking for the content of the a
tags:
comments = []
doc.css('a').each do |a|
comments << a.text
end
puts comments
# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta