1

I have a .Net app using regular expressions to extract information out of some html. The html is not XML compliant, so I can't parse it using XDoc. Here is a small piece of the html that I'm having problems with:

<td class="program">
    <div>
        <h2>
            The O'Reilly Factor
        </h2>
    </div>
</td>
<td class="program">
    <div>
        <span class="font-icon-new">New</span>
        <h2>
            The Kelly File
        </h2>
    </div>
</td>

The regular expression I'm using is:

(<td class="program">.*?(?<isnew>font-icon-new)?.*</td>)+

What I'm expecting in this scenario is two captured groups. The first group's "isnew" group would be blank (a non-hit), but the second group's "isnew" group would be populated. However, the "isnew" group is always blank, and I've tried multiple variations and simplified it down as much as possible to no avail. I'm also using the RegexOptions.Singleline option to ensure the "." also matches newline characters. Any ideas on what I'm missing?

Thanks in advance.

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
cas4
  • 321
  • 1
  • 11
  • 1
    [You might want to reconsider parsing HTML with regular expressions.](http://stackoverflow.com/a/1732454/2825369) Maybe look at the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/)? – Ben N May 07 '15 at 00:32
  • Are you looking to extract the actual XElements or just the actual information between the H2 tags? – ΩmegaMan May 07 '15 at 00:40
  • I just want to know of the existence of the "font-icon-new" class. This logic has been simplified down ad much as possible to troubleshoot this, but I am successfully extracting other pieces of information from within the tags. – cas4 May 07 '15 at 01:28

1 Answers1

0

I think you are misusing (if not abusing) the regex engine. Since you already have to check if a known sequence of characters can be inside the string, can't you use a simple String.Contains()?

Now, why this regex does not capture the attribute value. ?and .* are greedy quantifiers, while .*? is lazy. Let's add capturing groups around those subpatterns to see what exactly we are capturing:

(<td class="program">(.*?)(?<isnew>font-icon-new)?(.*)</td>)+

Group 2 ((.*?)) is NULL! Everything after <td class="program"> is captured into Group 3 ((.*)). Have a look at this excerpt (taken from here):

In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by quantifiers, the engine always chooses to first make the attempt for greedy quantifiers, and to first skip the attempt for lazy (non-greedy) ones. - Mastering Regular Expressions, p.159

The best regex fix I can imagine is combining the optional word and the next .*? pattern into an optional (greedy) non-capturing group like (?:(?<isnew>font-icon-new).*?)?:

(<td class="program">.*?(?:(?<isnew>font-icon-new).*?)?</td>)+

Results in Expresso (Note: Singleline mode is ON):

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • As I stated in a previous comment, the example I showed is a very simplified version. I'm also capturing other information such as title, description, times, etc. And since there are two sets that I'm searching through, a simple string.Contains will not work. Adding the non-capturing group fixed my problem. Thanks for your explanation on what was going on. – cas4 May 07 '15 at 12:52
  • Ok, I think it is oversimplified then :) Please don't get me wrong, just SO "gurus" usually frown at such samples, and those who answer often get downvoted because of that. Good luck! – Wiktor Stribiżew May 07 '15 at 13:26