-4

I have an HTML page which has only one <table> tag but many <tr> and <td> tags.

Example:

<tr attributes >
    <td>Name1</td>
    <td>some text</td>
    <td>some text</td>
</tr>                                                            1.
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1989</td>
    <td>some text</td>
</tr>
------------------------------------------------------------------------------
<tr attributes >
    <td>Name2</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>                                            
</tr>
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1979</td>
    <td>some text</td>
</tr>
------------------------------------------------------------------------------
<tr attributes >
    <td>Name3</td>
    <td>some text</td>
    <td>some text</td>
</tr>                                                                  2.
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1089</td>
    <td>some text</td>
</tr>

Now suppose I want the rows between NAME1 and the following TOTAL and NAME3 and the following TOTAL.

There can be any number of rows and columns between this...

The size of rows and column is not fixed.

So the output should include 1. and 2.

JDB
  • 25,172
  • 5
  • 72
  • 123
Proneet
  • 139
  • 2
  • 10
  • 3
    http://htmlagilitypack.codeplex.com/ – I4V May 29 '13 at 13:06
  • I dont want to use third part tools. – Proneet May 29 '13 at 13:09
  • 1
    Then read this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – I4V May 29 '13 at 13:10
  • @I4V - Maybe you should read this: http://stackoverflow.com/a/4234491/211627 – JDB May 29 '13 at 13:12
  • Aside from @I4V's point that you shouldn't use regex for this, it's not clear *exactly* what you want the output to be. Can you edit you question to show *exactly* what you want to extract? – Matt Burland May 29 '13 at 13:12
  • @Cyborgx37 I am eager to see your answer with regex. – I4V May 29 '13 at 13:13
  • What are you trying to achieve? There might be better ways to extract the information than using regular expressions. –  May 29 '13 at 13:18
  • 1
    @I4V - I would choose an HTML parser, because I like to spend time with my family. That said, the "parroted" "fact" that you cannot do this is simply wrong (especially given that .NET regexes [incorporate an NFA engine](http://msdn.microsoft.com/en-us/library/e347654k.aspx)). – JDB May 29 '13 at 13:18
  • You really need to revise your original post to make it clearer what you are asking for assistance with. At the moment your problem is not clear. – muttonlamb May 29 '13 at 13:20

2 Answers2

0

If you wanna have groups to separate texte from html use this one :

<td>Name(1|3)</td>((\s*<td>([^<]+)</td>\s*)+</tr>(.*?)<tr>)+?\s*<td>Total</td>

you have to add the option "s" (dot all mode)

Sidux
  • 557
  • 1
  • 5
  • 15
0

I agree with the others when they say you should use a parser. That solution would be more robust than a regex. But if you know the HTML you will run the regex against will not change much, the regex approach can work. Know that even a small change to the HTML can cause this solution to fail later on. For example, if you add attributes to any of the inner rows, this regex will not find a match. The regex can be made to work in that case as well, but then it gets more complicated and harder to read.

This regex works against the sample HTML you provided in your question. Use capture group 1 to get only the inner rows,

<tr\s+[^>]+>\s*<td>Name(?:1|3)</td>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>((?:\s*<tr>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>)+?)\s*<tr>\s*<td>Total</td>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>

Here is a rough breakdown of the regex:

#Matche the first row.
<tr\s+[^>]+>                    #Match the opening TR tag, allow for any attributes found inside the tag.
\s*<td>Name(?:1|3)</td>         #Match the first cell. Only allow its contents to be "Name1" or "Name3".
(?:\s*<td>[\w\s-]+</td>)+       #Match all other cells in this row.
\s*</tr>                        #Match the end of the row.

#Match all rows between the first and last row.
(?:
    \s*<tr>                         #Match the beginning of an inner row.
        (?:\s*<td>[\w\s-]+</td>)+   #Match all the cells in the current row.
    \s*</tr>                        #Match the end of the current row.
)+?

#Match the last row.
\s*<tr>                         #Match the beginning of the last row.
\s*<td>Total</td>               #Match the first cell. Only allow its contents to be "Total".
(?:\s*<td>[\w\s-]+</td>)        #Match all other cells in this row.
+\s*</tr>                       #Match the end of the last row.
Francis Gagnon
  • 3,545
  • 1
  • 16
  • 25