0

I have a string like this (made from HTML source code):

<tr>
  <td>
    <tr>First</tr>
  </td>
</tr>
<tr>
  <td>Second</td>
</tr>
<tr>
  <td>
    <tr>
      <td>Upper</td>
    </tr>
    <tr>
      <td>Lower</td>
    </tr>
  </td>
</tr>

but in one line - I divided it to make it look better. What I want to achieve is a regular expression that will capture whole rows of this table, so the matches are:

<td>
  <tr>First</tr>
</td>

,

<td>Second</td>

,

<td>
  <tr>
    <td>Upper</td>
  </tr>
  <tr>
    <td>Lower</td>
  </tr>
</td>

The most simple options:

  • <tr>.*</tr> - catches everything
  • <tr>.*?</tr> - catches from the first <tr> to the first </tr>.

I want it to catch corresponding tags. Can anybody help?

Andrew Thompson
  • 168,117
  • 40
  • 217
  • 433
karex
  • 221
  • 1
  • 6
  • 14
  • 5
    Use an HTML parser to parse HTML. And in future, please review the ***preview*** carefully before posting. – Andrew Thompson Jun 13 '13 at 12:30
  • 2
    Use something like [JSoup](http://jsoup.org) or you'll get [burned](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – fge Jun 13 '13 at 12:32
  • 2
    Relevant: http://stackoverflow.com/questions/238036/java-html-parsing – ohaal Jun 13 '13 at 12:32
  • use a counter...++ the tags and -- the tags, anytime the counter hits zero print – orangegoat Jun 13 '13 at 12:32
  • 2
    You should **not** use a regex for parsing HTML. This answer provides a fantastic explanation why you shouldn't do it: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – gkalpak Jun 13 '13 at 12:32
  • 1
    @ExpertSystem :) That is a classic. – Andrew Thompson Jun 13 '13 at 12:33
  • @AndrewThompson: I know :) It's the best answer ever !!! – gkalpak Jun 13 '13 at 12:34
  • 2
    @AndrewThompson I think SO should put this link in the face of the user if it sees `regex` and `html` as a tag combination ;) – fge Jun 13 '13 at 12:34
  • This is possible using regex using recursive pattern `(?R)` which Java doesn't support. Here's a [demo](http://regex101.com/r/tM3fO4) using PHP PCRE flavor, note that it doesn't work like expected but it's just a [poc](http://en.wikipedia.org/wiki/Proof_of_concept). So you're better off using an html parser. – HamZa Jun 13 '13 at 13:45
  • 2
    `First` is invalid html by the way – Balint Bako Jun 13 '13 at 14:24

1 Answers1

1

You could use html parsing engine jsoup and run something like this to pull out rows from your table

String url = "a.html";
Document doc = Jsoup.connect(url).get();

Elements rows = doc.select("table tr");
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43