0

I'm looking for a way to parse this kind of HTML in Java.

<tr class="cBackHeader backCat" ...>
   <th class="padding" ...>
       ...
       <a href="{{URL CATEGORY}}" class="cHeader">{{TITLE CATEGORY}}</a>
   </th>
</tr>
(<tr class="sujet..." ...>
   ...
   <td ... class="subjectCase3" ...>
        <a href="{{URL TOPIC}}" class="cCatTopic" title="{{ID TOPIC}}">{{TITLE TOPIC}}</a>
   </td>
   ...
</tr>)+

I would like to get in the good order each variable between {{ }}. I've managed to get the first part with this pattern :

<th class=\"padding\".*?>.*?<a href=\"(.+?)\" class=\"cHeader\">(.+?)</a></th>

But I don't know how to do with the second part (they may be many td.subjectCase3).

Edit : here is my solution with Jsoup but this is less optimize than with Pattern and Matcher.

 Document document = Jsoup.parse(response);
 Element tmp;
 Elements elements = document.select("tr.cBackHeader,tr.sujet");
 for (Element el : elements) {
   if (el.hasClass("cBackHeader")) {
     tmp = el.select("a.cHeader").first();
     result.add(new TopicItem(null, tmp.ownText()));
   } else if (el.hasClass("sujet")) {
     tmp = el.select("td.sujetCase3 a").first();
     result.add(new TopicItem(new Topic(tmp.attr("title"), tmp.attr("href"), tmp.ownText()), null));
   }
 }

What do you think ?

BkSouX
  • 739
  • 5
  • 13
  • 29

1 Answers1

-1

I would use regex expression:

(href="{{).+?[}]

Which would always pull out: href="{{URL TOPIC}

Then in Java I would use a String.split() method.

String string = "href="{{URL TOPIC}";
String[] parts = string.split("{");
String part1 = parts[0]; // href="
String part2 = parts[1]; // ""
String part3 = parts[2]; // URL TOPIC}

From there I would trim the trailing "}";

return part3.trim("}");

It isn't pretty, but it gets results.

James Ruiz
  • 101
  • 6
  • I'm sorry, but if you basically say that you don't really know how to do this right and you know you do it in an awful way, then maybe it's not the right topic to be giving advice on. Just my 2c. – Ingo Bürk Nov 13 '14 at 20:24