0

I am trying to parse out an html page that has table rows in it. I need to get all of the table cells within a table row.

Here's a sample of the html that I"m trying to parse:

<tr style="font-size:8pt;">
    <TD style="font-size:8pt;">1545644656</TD>
    <TD style="font-size:8pt;">Billy</TD>
    <TD style="font-size:8pt;">Johnson</TD>
    <TD style="font-size:8pt;">DEF</TD>

        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">1134 Main St</TD>
        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">AnyTown</TD>
        <TD style="font-size:8pt;">PA</TD>
        <TD style="font-size:8pt;">05405</TD>

</TR>

and here is the regex I"m using to get all of the stuff between the tr start and tr end

Regex exp = new Regex("<tr style=\"font-size:8pt;\">(.*?)</TR>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

I'm then doing a foreach loop to loop over all of my matches (there will be multiple rows)

foreach (Match mtch in exp.Matches(browser.Html))

but it's not matching anything. I had this exact same code working on the site before they added new lines (\n) when it was all just one single long string...now it doesn't match anything with the multi-line approach they're using.

Any ideas here?

Christopher Johnson
  • 2,629
  • 7
  • 39
  • 70
  • Parsing HTML with a regex is a bad idea. See [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/427192) to find out why. – Dan Pichelman May 14 '13 at 18:36

2 Answers2

2

I would use a real html parser like HtmlAgilityPack for this

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var tds = doc.DocumentNode.Descendants("td")
                .Select(td=>td.InnerText)
                .ToList();
I4V
  • 34,891
  • 6
  • 67
  • 79
  • this looks like a good option for future projects, but I was already almost there with what I currently had. gonna upvote it though cause I do think it looks like a good resource...thanks. – Christopher Johnson May 14 '13 at 19:15
0

. is a wildcard which matches any character but \n.

http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_classes

http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx

I believe you need RegexOptions.Singleline instead.