4

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?

Mad Wombat
  • 14,490
  • 14
  • 73
  • 109

1 Answers1

7

For nth-of-type, does the following example help?

user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
           "<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
                   [[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})

No idea about the second issue. Your approach seems to work with a naive test:

user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})

Any chance of looking at your actual HTML?

Update: (in response to the comment)

Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:

user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
                   [[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})
Michał Marczyk
  • 83,634
  • 13
  • 201
  • 212
  • Seems like the second problem might be due to bad HTML. Can I combine nth-of-type with other selectors? If i need to find second table inside second table, can I do something like [:table (nth-of-type 2) :> :table (nth-of-type 2)]? – Mad Wombat Apr 23 '10 at 08:44
  • 1
    Ah! [] are intersections! The enlightenment is near! – Mad Wombat Apr 23 '10 at 20:47