I'm looking for a way to parse this kind of HTML in Java.
<tr class="cBackHeader backCat" ...>
<th class="padding" ...>
...
<a href="{{URL CATEGORY}}" class="cHeader">{{TITLE CATEGORY}}</a>
</th>
</tr>
(<tr class="sujet..." ...>
...
<td ... class="subjectCase3" ...>
<a href="{{URL TOPIC}}" class="cCatTopic" title="{{ID TOPIC}}">{{TITLE TOPIC}}</a>
</td>
...
</tr>)+
I would like to get in the good order each variable between {{ }}. I've managed to get the first part with this pattern :
<th class=\"padding\".*?>.*?<a href=\"(.+?)\" class=\"cHeader\">(.+?)</a></th>
But I don't know how to do with the second part (they may be many td.subjectCase3).
Edit : here is my solution with Jsoup but this is less optimize than with Pattern and Matcher.
Document document = Jsoup.parse(response);
Element tmp;
Elements elements = document.select("tr.cBackHeader,tr.sujet");
for (Element el : elements) {
if (el.hasClass("cBackHeader")) {
tmp = el.select("a.cHeader").first();
result.add(new TopicItem(null, tmp.ownText()));
} else if (el.hasClass("sujet")) {
tmp = el.select("td.sujetCase3 a").first();
result.add(new TopicItem(new Topic(tmp.attr("title"), tmp.attr("href"), tmp.ownText()), null));
}
}
What do you think ?