Why is this regular expression in.NET capturing more than I intend?

Question

Not sure how to title this question exactly - I'm open to suggestions. Clearly, I'm doing something wrong with my regular expression.

I'm using .NET 4.6.2 Regex class with the options:

RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline

The input is as follows:

<!--malformed HTML beyond my control-->
<table summary="Profile Information" width="100%">
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Name: </td>
        <td align="right">Bob Smith</td>
    </tr>
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Position: </td>
        <td valign="bottom" align="right">IT Director</td>
    </tr>
    <tr>
        <td valign="top" scope="row">Address: </td>
        <td valign="bottom" align="right">1234 Main St
                    Austin, TX
        </td>
    </tr>
</table>
<!--malformed HTML beyond my control-->

My regular expression is as follows:

<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>\s*</tr>

I'm expecting it to match values from the table rows that have two cells defined, and skip the rows that have only a single cell defined. Furthermore, I'm expected it to capture both the property name (i.e. Name:, Position:, Address:) and the values associated with them.

Instead, I'm getting the following captures:

Matched String <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row">Profile</td> </tr> <tr> <td valign="top" scope="row">Name: </td> <td align="right">Bob Smith</td> </tr>

$1 </td> </tr> <tr> <td colspan="2" scope="row">Profile</td> </tr> <tr> <td valign="top" scope="row">Name:

$2 Bob Smith
Matched String <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row">Personal Information</td> </tr> <tr> <td valign="top" scope="row">Position: </td> <td valign="bottom" align="right">IT Director</td> </tr> $1 </td> </tr> <tr> <td colspan="2" scope="row">Personal Information</td> </tr> <tr> <td valign="top" scope="row">Position: $2 IT Director
Matched String <tr> <td valign="top" scope="row">Address: </td> <td valign="bottom" align="right">1234 Main St Austin, TX </td> </tr> $1 Address: $2 1234 Main St Austin, TX

I apologize for not being able to put the results into a more succinct format. Tables aren't allowed for questions apparently.

What I think might be going wrong

It seems to me that one of my dot matchers is matching more than I want it to match. I've told them to be non-greedy (.*?), so I'm a little confused why they seem to be matching beyond the first encountered ending tag.

As far as I can tell, this should never be in any match:

<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>

Yet, it appears in the first matched string.

What am I missing? How should this be achieved?

Let me know if there is any additional information required for this question.

P.S. I've been using http://regexstorm.net/tester to attempt and debug the issue.

Please explain why you are using an improper tool to parse HTML. I understand you have access to code, so why not use something like HtmlAgilityPack? — Wiktor Stribiżew, Jun 27 '17 at 20:30
@WiktorStribiżew Because the HTML is invalid, and can't be read by a DOM reader. I have no control over the input text. This portion is fine, but other bits of the entire HTML file throw HtmlAgilityPack for a loop. — crush, Jun 27 '17 at 20:30
@WiktorStribiżew I suppose I could try to extract just this table and feed it to HtmlAgilityPack. — crush, Jun 27 '17 at 20:35
I will close this one as the solution is clear: tempered greedy token. `.*?` do not guarantee the shortest matches between strings. Either use `[^<>]*` when inside an element node, or use `(?:(?!` and next `` in order not to overmatch. Adapt as per your needs. — Wiktor Stribiżew, Jun 27 '17 at 20:36
@WiktorStribiżew I've never heard of tempered greedy token, but it does seem like it could solve the issue. Thanks for bringing to my attention. — crush, Jun 27 '17 at 20:51

score 1 · Accepted Answer · answered Jun 27 '17 at 20:38

1

Non-greedy matches won’t affect the behaviour of taking the first match. If there’s a greedy match at a given position, there will also be a non-greedy match at that position. You can hack it by not matching any </td>s:

<tr>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*</tr>

But I’d actually do it in two steps, by first matching:

<tr>(.*?)</tr>

and then inside each of those, checking the rest of the simpler expression.

answered Jun 27 '17 at 20:38

Ry-

218,210
55
464
476

I'm not getting the desired output from your first expression either, but I think it's the right idea, and just needs some slight tweaking. It seems to use the tempered greedy token approach mentioned in the comments. – crush Jun 27 '17 at 20:52
I came up with this, based on your answer: `\s*).)*>((?:(?!).)+)\s*).)*>((?:(?!).)+)\s*` I'm going to try and apply some of the further instruction from @WiktorStribiżew from [his answer here](https://stackoverflow.com/a/37343088/1195273) – crush Jun 27 '17 at 21:08
@crush: Oh, yep, I missed those `.*`s. Sorry. (And it is the same approach, but note that “tempered greedy token” isn’t a standard term or something I’d ever call it….) – Ry- Jun 27 '17 at 21:53

Rene · Answer 2 · 2017-06-27T21:54:28.543

1

Try .*? Instead of .* This should disable the greedy look ahead

Try this:

string regTR = @"<tr>(.+?)</tr>";
Regex ItemRegex = new Regex(regTR, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

var matches = ItemRegex.Matches(readText);
foreach (Match ItemMatch in matches)
{
   string outer = ItemMatch.Groups[0].Value;
   string innerRegex = @"<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>";

   Match match = Regex.Match(outer, innerRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

   if (match.Success)
   {
        string inner1 = match.Groups[1].Value;
        string inner2 = match.Groups[2].Value;                    
   }
}

edited Jun 27 '17 at 21:54

answered Jun 27 '17 at 20:40

Rene

70
6

Isn't that what I was already doing? – crush Jun 27 '17 at 21:08
Sorry, didn't see you already used ? in your regex. Give me a second I try to reproduce your issue – Rene Jun 27 '17 at 21:15
I edited my post – Rene Jun 27 '17 at 21:54
Good answer showing how to do it in multiple steps, but I elected to use the single regexp approach shown above. +1 – crush Jun 28 '17 at 14:13

Why is this regular expression in.NET capturing more than I intend?

What I think might be going wrong

2 Answers2