4

I need to scrape html that has the following form:

<div id='content'>
    <h3>Headline1</h3>
    <div>Text1</div>
    <div>Text2</div>
    <div>Text3</div>
    <h3>Headline2</h3>
    <div>Text4</div>
    <div>Text5</div>
    <h3>Headline3</h3>
    <div>Text6</div>
    <div>... and so on ...</div>
</div>

I need to get the content between the headline tags as separate chunks. So from one headline up to the next. Unfortunately there is no container tag for the desired ranges.

I tried the fragment selector {[:h3] [:h3]} but somehow this only returns all h3 tags, without the tags in between them: (({:tag :h3, :attrs nil, :content ("Headline1")}) ({:tag :h3, :attrs nil, :content ("Headline2")}) ({:tag :h3, :attrs nil, :content ("Headline3")}))

What does work, is {[[:h3 (html/nth-of-type 1)]] [[:h3 (html/nth-of-type 2)]]}. This gives me all of the html between the first and second h3-tag. However this does not give me all of the desired chunks with one selector.

Can enlive do this at all or should I resort to a regular expression?

Thanks!

Majnu
  • 524
  • 4
  • 14
  • Does this answer help?: http://stackoverflow.com/questions/17157780/range-selectors-in-enlive PS you cannot use regular expression to parse HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags If you try you may wake the ancient ones ;-) – Arthur Ulfeldt Sep 03 '14 at 20:56
  • Unfortunately it doesn't seem to help. When I select with {[:h3] [:h3]} I only get the h3-tags, but not the nodes in between them. – Majnu Sep 03 '14 at 21:14
  • If you add an html snippet and your selector we may be able to write one that matches. Bonus points if you include the output you are getting now. – Arthur Ulfeldt Sep 03 '14 at 21:26
  • Hi, I've edited the post with a snippet and a more detailed description. Thanks! – Majnu Sep 03 '14 at 21:47

1 Answers1

0

Select out everything in div.content and then partition them base upon tag.

There is a more general concept here of separating a sequence of things into segments by identifying which things are separators and which are not:

(defn separate*
  "Produces a sequence of (parent child*)*, coll must start with a parent"
  [child? coll]
  (lazy-seq
   (when-let [s (seq coll)]
     (let [run (cons (first s)
                     (take-while child? (next s)))]
       (cons run (separate* child? (drop (count run) s)))))))

Very similar to partition-by, but always splits on parent:

(partition-by keyword? [:foo 1 2 3 :bar :baz 4 5])
;; => ((:foo) (1 2 3) (:bar :baz) (4 5))

(separate* (compliment keyword?) [:foo 1 2 3 :bar :baz 4 5])
;; => ((:foo 1 2 3) (:bar) (:baz 4 5))

If you want to handle when there is no leading title:

(defn separate
  [parent? coll]
  (when-let [s (seq coll)]
    (if (parent? (first coll))
      (separate* (complement parent?) coll)
      (let [child? (complement parent?)
            run (take-while child? s)]
        (cons (cons nil run)
              (separate* child? (drop (count run) s)))))))

(separate keyword? [1 2 :foo 3 4])
;; => ((nil 1 2) (:foo 3 4))

And returning to the problem at hand:

(def x [{:tag :h3 :content "1"}
        {:tag :div :content "A"}
        {:tag :div :content "B"}
        {:tag :h3 :content "2"}
        {:tag :div :content "C"}
        {:tag :div :content "D"}])

(def sections (separate #(= :h3 (:tag %)) x))
=> (({:content "1", :tag :h3}
     {:content "A", :tag :div
     {:content "B", :tag :div})
    ({:content "2", :tag :h3}
     {:content "C", :tag :div}
     {:content "D", :tag :div}))

If we don't care to retain the content of the h3 titles:

(map rest sections)
=> (({:content "A", :tag :div} {:content "B", :tag :div})
    ({:content "C", :tag :div} {:content "D", :tag :div}))
Timothy Pratley
  • 10,586
  • 3
  • 34
  • 63