0

Not sure how to title this question exactly - I'm open to suggestions. Clearly, I'm doing something wrong with my regular expression.

I'm using .NET 4.6.2 Regex class with the options:

RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline

The input is as follows:

<!--malformed HTML beyond my control-->
<table summary="Profile Information" width="100%">
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Name: </td>
        <td align="right">Bob Smith</td>
    </tr>
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Position: </td>
        <td valign="bottom" align="right">IT Director</td>
    </tr>
    <tr>
        <td valign="top" scope="row">Address: </td>
        <td valign="bottom" align="right">1234 Main St
                    Austin, TX
        </td>
    </tr>
</table>
<!--malformed HTML beyond my control-->

My regular expression is as follows:

<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>\s*</tr>

I'm expecting it to match values from the table rows that have two cells defined, and skip the rows that have only a single cell defined. Furthermore, I'm expected it to capture both the property name (i.e. Name:, Position:, Address:) and the values associated with them.

Instead, I'm getting the following captures:

  1. Matched String <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td> </tr> <tr> <td valign="top" scope="row">Name: </td> <td align="right">Bob Smith</td> </tr>

    $1 </td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td> </tr> <tr> <td valign="top" scope="row">Name:

    $2 Bob Smith

  2. Matched String <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td> </tr> <tr> <td valign="top" scope="row">Position: </td> <td valign="bottom" align="right">IT Director</td> </tr> $1 </td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td> </tr> <tr> <td valign="top" scope="row">Position: $2 IT Director
  3. Matched String <tr> <td valign="top" scope="row">Address: </td> <td valign="bottom" align="right">1234 Main St Austin, TX </td> </tr> $1 Address: $2 1234 Main St Austin, TX

I apologize for not being able to put the results into a more succinct format. Tables aren't allowed for questions apparently.

What I think might be going wrong

It seems to me that one of my dot matchers is matching more than I want it to match. I've told them to be non-greedy (.*?), so I'm a little confused why they seem to be matching beyond the first encountered ending tag.

As far as I can tell, this should never be in any match:

<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>

Yet, it appears in the first matched string.

What am I missing? How should this be achieved?

Let me know if there is any additional information required for this question.

P.S. I've been using http://regexstorm.net/tester to attempt and debug the issue.

crush
  • 16,713
  • 9
  • 59
  • 100
  • 1
    Please explain why you are using an improper tool to parse HTML. I understand you have access to code, so why not use something like HtmlAgilityPack? – Wiktor Stribiżew Jun 27 '17 at 20:30
  • @WiktorStribiżew Because the HTML is invalid, and can't be read by a DOM reader. I have no control over the input text. This portion is fine, but other bits of the entire HTML file throw HtmlAgilityPack for a loop. – crush Jun 27 '17 at 20:30
  • @WiktorStribiżew I suppose I could try to extract just this table and feed it to HtmlAgilityPack. – crush Jun 27 '17 at 20:35
  • 1
    I will close this one as the solution is clear: tempered greedy token. `.*?` do not guarantee the shortest matches between strings. Either use `[^<>]*` when inside an element node, or use `(?:(?!` and next `` in order not to overmatch. Adapt as per your needs. – Wiktor Stribiżew Jun 27 '17 at 20:36
  • @WiktorStribiżew I've never heard of tempered greedy token, but it does seem like it could solve the issue. Thanks for bringing to my attention. – crush Jun 27 '17 at 20:51

2 Answers2

1

Non-greedy matches won’t affect the behaviour of taking the first match. If there’s a greedy match at a given position, there will also be a non-greedy match at that position. You can hack it by not matching any </td>s:

<tr>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*</tr>

But I’d actually do it in two steps, by first matching:

<tr>(.*?)</tr>

and then inside each of those, checking the rest of the simpler expression.

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • I'm not getting the desired output from your first expression either, but I think it's the right idea, and just needs some slight tweaking. It seems to use the tempered greedy token approach mentioned in the comments. – crush Jun 27 '17 at 20:52
  • I came up with this, based on your answer: `\s*).)*>((?:(?!).)+)\s*).)*>((?:(?!).)+)\s*` I'm going to try and apply some of the further instruction from @WiktorStribiżew from [his answer here](https://stackoverflow.com/a/37343088/1195273) – crush Jun 27 '17 at 21:08
  • @crush: Oh, yep, I missed those `.*`s. Sorry. (And it is the same approach, but note that “tempered greedy token” isn’t a standard term or something I’d ever call it….) – Ry- Jun 27 '17 at 21:53
1

Try .*? Instead of .* This should disable the greedy look ahead

Try this:

string regTR = @"<tr>(.+?)</tr>";
Regex ItemRegex = new Regex(regTR, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

var matches = ItemRegex.Matches(readText);
foreach (Match ItemMatch in matches)
{
   string outer = ItemMatch.Groups[0].Value;
   string innerRegex = @"<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>";

   Match match = Regex.Match(outer, innerRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

   if (match.Success)
   {
        string inner1 = match.Groups[1].Value;
        string inner2 = match.Groups[2].Value;                    
   }
}
Rene
  • 70
  • 6