I need to scrape html that has the following form:
<div id='content'>
<h3>Headline1</h3>
<div>Text1</div>
<div>Text2</div>
<div>Text3</div>
<h3>Headline2</h3>
<div>Text4</div>
<div>Text5</div>
<h3>Headline3</h3>
<div>Text6</div>
<div>... and so on ...</div>
</div>
I need to get the content between the headline tags as separate chunks. So from one headline up to the next. Unfortunately there is no container tag for the desired ranges.
I tried the fragment selector {[:h3] [:h3]}
but somehow this only returns all h3 tags, without the tags in between them:
(({:tag :h3, :attrs nil, :content ("Headline1")}) ({:tag :h3, :attrs nil, :content ("Headline2")}) ({:tag :h3, :attrs nil, :content ("Headline3")}))
What does work, is {[[:h3 (html/nth-of-type 1)]] [[:h3 (html/nth-of-type 2)]]}
. This gives me all of the html between the first and second h3-tag. However this does not give me all of the desired chunks with one selector.
Can enlive do this at all or should I resort to a regular expression?
Thanks!