-1

I have a string:

Test.
<div>
<table style="color:blue;"><tbody><!--START SPACE COMMENTS SUMMARY-->
<tr><td colspan="2">SPACE COMMENTS SUMMARY</td></tr>
<tr><td style="min-width:200px;">Area/Room</td>
<td style="max-width:300px;text-align:left;">Comments</td>
</tr><tr><td style="min-width:200px;">Bathroom</td>
<td style="max-width:300px;text-align:left;">Some comment</td></tr>
<!--END SPACE COMMENTS SUMMARY--></tbody></table>
<div>
<table style="color:blue;"><tbody><!--START SPACE SUMMARY-->
<tr><td colspan="2">SPACE SUMMARY</td></tr><tr>
<td style="min-width:200px;">Space</td>
<td style="max-width:300px;text-align:right;">Installed Price</td></tr>
<tr><td style="min-width:200px;">Bathroom</td>
<td style="max-width:300px;text-align:right;">$2,355.97</td></tr>
<!--END SPACE SUMMARY--></tbody></table>
<br><br><br><div>Some text.</div></div></div>

I want to select with regex a table that has comments <!--START SPACE SUMMARY> and <!--END SPACE SUMMARY-->.

I tried with @"<table.*?><tbody.*?><!--START SPACE SUMMARY>.*?<!--END SPACE SUMMARY--></tbody></table>", but it selects both tables in the string.

EDIT: My question doesn't have to do precisely with HTML. The same question will stand if I had a string:

some text blah blah one some text blah blah two.

And I want to select some text blah blah two with a pattern some text.*?two.

kiriz
  • 655
  • 1
  • 7
  • 24

3 Answers3

1
string test = @"Test.
    <div>
    <table style=""color:blue;""><tbody><!--START SPACE COMMENTS SUMMARY-->
    <tr><td colspan=""2"">SPACE COMMENTS SUMMARY</td></tr>
    <tr><td style=""min-width:200px;"">Area/Room</td>
    <td style=""max-width:300px;text-align:left;"">Comments</td>
    </tr><tr><td style=""min-width:200px;"">Bathroom</td>
    <td style=""max-width:300px;text-align:left;"">Some comment</td></tr>
    <!--END SPACE COMMENTS SUMMARY--></tbody></table>
    <div>
    <table style=""color:blue;""><tbody><!--START SPACE SUMMARY-->
    <tr><td colspan=""2"">SPACE SUMMARY</td></tr><tr>
    <td style=""min-width:200px;"">Space</td>
    <td style=""max-width:300px;text-align:right;"">Installed Price</td></tr>
    <tr><td style=""min-width:200px;"">Bathroom</td>
    <td style=""max-width:300px;text-align:right;"">$2,355.97</td></tr>
    <!--END SPACE SUMMARY--></tbody></table>
    <br><br><br><div>Some text.</div></div></div>";

MatchCollection matches = Regex.Matches(test, @"<table(?!.*<table).*?<!--START SPACE SUMMARY-->.*?<!--END SPACE SUMMARY-->.*?table>", RegexOptions.Singleline);

The idea is to use (?!.*<table) to tell Regex engine the the text within should not contain another table anchor.

Ghasan غسان
  • 5,577
  • 4
  • 33
  • 44
1

Let's focus on a non-HTML problem you have: match the closest window between two delimiters. Use a tempered greedy token:

(?s)some text(?:(?!some text|two).)*two
    |<-1st->||<----TG Token ------>||
                                    |2nd delimiter

See the regex demo

For an HTML parsing, use HtmlAgilityPack, it will make life easier to everyone who is going to maintain your code.

The (?s) turns on DOTALL mode when . matches any character including a newline and (?:(?!some text|two).)* tempered greedy token will match any character that is not the starting character of some text or two literal character sequences.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • All right. But to put in context of what I really need - how to get `some maybe more text blah blah two` from string `some more text blah blah one some maybe more text blah blah two`? Using words `some`, `text` and `two`. – kiriz Apr 25 '16 at 10:43
  • Same way, just the right side can be anything, as we are not interested in it: https://regex101.com/r/pE1qG5/1 – Wiktor Stribiżew Apr 25 '16 at 10:47
  • Have you checked the suggested solution? It is selecting all the text. – kiriz Apr 25 '16 at 13:04
  • 1
    Ok, I see, I updated it before closing the window, sorry. The safest way is to use the token on the right side, too: [`some(?:(?!some|two|text).)*text(?:(?!some|two|text).)*two`](https://regex101.com/r/pE1qG5/2). – Wiktor Stribiżew Apr 25 '16 at 13:06
0

Try this:

<table.*?><tbody.*?><!--START (SPACE SUMMARY)>.*?<!--END \1--><\/tbody><\/table>

It should be done with non-greedy, but I try to use variable \1 here to repeat group 1 value. And also escape the / to \/. Maybe that's the problem source.

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108