1

I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.

<TABLE>
 <TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
 <TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>

But a what I really need is

<TABLE>
 <TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
 <TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
PARUS
  • 93
  • 1
  • 1
  • 6
  • Could you clarify what do you want from that table? What "between last in each row" means? – Nobita Oct 05 '11 at 20:19
  • 2
    Psst... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – driis Oct 05 '11 at 20:21
  • 1
    If you want an easy solution that is maintainable, you don't want to use a regex for this. If this is just a personal programming exercise because, say, Climbing Mt. Everest while naked and tripping balls is just not AWESOME enough for you, well then, try to use a regex. But, really, you don't want to use a regex for this. – Michael Paulukonis Oct 05 '11 at 20:41

2 Answers2

2

Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.

Heres a good blog post explaining the logic and offering alternatives: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Kelend
  • 1,417
  • 2
  • 11
  • 18
  • Unfortunately, it's not solve my problem. I absolutely agreed it's bad idea, but such HTML returns by MS Word during Cope-Paste and I need to remove this tag between to made view looks correct. And customers doesn't user JQuery. So, this is hard fix which will be replaced in future with better solution. But now I just need to replace this in so many places, and regEx is good way for now. – PARUS Oct 05 '11 at 20:48
1
</TD>((?:(?!</T[DR]>).)*)</TR>

The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.

It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.

You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156