Not sure how to title this question exactly - I'm open to suggestions. Clearly, I'm doing something wrong with my regular expression.
I'm using .NET 4.6.2 Regex
class with the options:
RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline
The input is as follows:
<!--malformed HTML beyond my control-->
<table summary="Profile Information" width="100%">
<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>
<tr>
<td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td>
</tr>
<tr>
<td valign="top" scope="row">Name: </td>
<td align="right">Bob Smith</td>
</tr>
<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>
<tr>
<td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td>
</tr>
<tr>
<td valign="top" scope="row">Position: </td>
<td valign="bottom" align="right">IT Director</td>
</tr>
<tr>
<td valign="top" scope="row">Address: </td>
<td valign="bottom" align="right">1234 Main St
Austin, TX
</td>
</tr>
</table>
<!--malformed HTML beyond my control-->
My regular expression is as follows:
<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>\s*</tr>
I'm expecting it to match values from the table rows that have two cells defined, and skip the rows that have only a single cell defined. Furthermore, I'm expected it to capture both the property name (i.e. Name:
, Position:
, Address:
) and the values associated with them.
Instead, I'm getting the following captures:
Matched String
<tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td> </tr> <tr> <td valign="top" scope="row">Name: </td> <td align="right">Bob Smith</td> </tr>
$1
</td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td> </tr> <tr> <td valign="top" scope="row">Name:
$2
Bob Smith
- Matched String
<tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td> </tr> <tr> <td valign="top" scope="row">Position: </td> <td valign="bottom" align="right">IT Director</td> </tr>
$1</td> </tr> <tr> <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td> </tr> <tr> <td valign="top" scope="row">Position:
$2IT Director
- Matched String
<tr> <td valign="top" scope="row">Address: </td> <td valign="bottom" align="right">1234 Main St Austin, TX </td> </tr>
$1Address:
$21234 Main St Austin, TX
I apologize for not being able to put the results into a more succinct format. Tables aren't allowed for questions apparently.
What I think might be going wrong
It seems to me that one of my dot matchers is matching more than I want it to match. I've told them to be non-greedy (.*?)
, so I'm a little confused why they seem to be matching beyond the first encountered ending tag.
As far as I can tell, this should never be in any match:
<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>
Yet, it appears in the first matched string.
What am I missing? How should this be achieved?
Let me know if there is any additional information required for this question.
P.S. I've been using http://regexstorm.net/tester to attempt and debug the issue.