Need a hand with regex in c# that spans multiple rows

Question

I am trying to parse out an html page that has table rows in it. I need to get all of the table cells within a table row.

Here's a sample of the html that I"m trying to parse:

<tr style="font-size:8pt;">
    <TD style="font-size:8pt;">1545644656</TD>
    <TD style="font-size:8pt;">Billy</TD>
    <TD style="font-size:8pt;">Johnson</TD>
    <TD style="font-size:8pt;">DEF</TD>

        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">1134 Main St</TD>
        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">AnyTown</TD>
        <TD style="font-size:8pt;">PA</TD>
        <TD style="font-size:8pt;">05405</TD>

</TR>

and here is the regex I"m using to get all of the stuff between the tr start and tr end

Regex exp = new Regex("<tr style=\"font-size:8pt;\">(.*?)</TR>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

I'm then doing a foreach loop to loop over all of my matches (there will be multiple rows)

foreach (Match mtch in exp.Matches(browser.Html))

but it's not matching anything. I had this exact same code working on the site before they added new lines (\n) when it was all just one single long string...now it doesn't match anything with the multi-line approach they're using.

Any ideas here?

Parsing HTML with a regex is a bad idea. See [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/427192) to find out why. — Dan Pichelman, May 14 '13 at 18:36

score 2 · Answer 1 · answered May 14 '13 at 18:47

2

I would use a real html parser like HtmlAgilityPack for this

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var tds = doc.DocumentNode.Descendants("td")
                .Select(td=>td.InnerText)
                .ToList();

answered May 14 '13 at 18:47

I4V

34,891
6
67
79

this looks like a good option for future projects, but I was already almost there with what I currently had. gonna upvote it though cause I do think it looks like a good resource...thanks. – Christopher Johnson May 14 '13 at 19:15

score 0 · Accepted Answer · answered May 14 '13 at 18:38

0

. is a wildcard which matches any character but \n.

http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_classes

http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx

I believe you need RegexOptions.Singleline instead.

answered May 14 '13 at 18:38

Nicole DesRosiers

688
7
20

I ended up just replacing all my whitespace with single white space chars and used the singleline method to make it work. Thanks. – Christopher Johnson May 14 '13 at 19:15

Need a hand with regex in c# that spans multiple rows

2 Answers2