1

I have an HTML file that I'm trying to extract data from. The regex I'm using is

"<tr.+?>.+?<td class=\"table_row_col2\"><b>(.+?)&.+?</b>.+?<td class=\"table_row_col5\">(.+?)</td>.+?<td class=\"table_row_col6\">(.+?)</td>.+?</tr>"

It works in Python but not in C#. Here's some sample data:

<tr class="table_row" style="background-color: #d3d3d3;">
    <td class="table_row_col1">271</td>
    <td class="table_row_col2"><b>16/09/2015&nbsp;05:28&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3"><span style="color:#e30613">14.3</span></td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">8</td>
    <td class="table_row_col6">-</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Before dinner</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

<tr class="table_row" style="background-color: #ffffff;">
    <td class="table_row_col1">272</td>
    <td class="table_row_col2"><b>16/09/2015&nbsp;02:54&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3"><span style="color:#e30613">17.6</span></td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">20</td>
    <td class="table_row_col6">32</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Other</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

<tr class="table_row" style="background-color: #d3d3d3;">
    <td class="table_row_col1">273</td>
    <td class="table_row_col2"><b>15/09/2015&nbsp;11:09&nbsp;PM</b></font></small></sup></td>
    <td class="table_row_col3">-</td>
    <td class="table_row_col4">-</td>
    <td class="table_row_col5">-</td>
    <td class="table_row_col6">34</td>
    <td class="table_row_col7">-</td>
    <td class="table_row_col8">Before Bed</td>
    <td class="table_row_col9">-</td>
    <td class="table_row_col10">-</td>
    <td class="table_row_col11">-</td>
</tr>

I'm trying to extract the date from table_row_col2 and the numbers from table_row_col5 and table_row_col6

  • 3
    I see you are new here - welcome to SO! And the first thing you should know is that HTML parsing is best done with HTML parsers, not with regex. Have you considered using one? Like [HtmlAgilityPack](https://htmlagilitypack.codeplex.com/), etc.? Every time someone posts a question about parsing HTML with regex, [*RegEx match open tags except XHTML self-contained tags*](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) link is shared :) – Wiktor Stribiżew Jan 28 '16 at 12:55
  • 2
    what is it returning in C# at the moment? – Rhumborl Jan 28 '16 at 12:55
  • How does it not work? Is there an exception, or not returning the right value? – Hayden Jan 28 '16 at 12:55
  • looks like a job for https://htmlagilitypack.codeplex.com/ – fubo Jan 28 '16 at 12:55
  • 2
    C# is returning all the HTML and no, I didn't try a HTML parser, I didn't really know you could extract values from an HTML parser, I'll see if I can use one and get it to work. Thanks for the replys. –  Jan 28 '16 at 13:00

1 Answers1

1

If you know the HTML never changes you can do it like this adding a class Split:

List<string> rows = Split.Extract(htmlString, "class=\"table_row\"", "</tr>");
foreach (string row in rows)
{
    string col2 = Split.Extract(row, "class=\"table_row_col2\"><b>", "</b>")[0];
    string col5 = Split.Extract(row, "class=\"table_row_col5\">", "</td>")[0];
    string col6 = Split.Extract(row, "class=\"table_row_col6\">", "</td>")[0];

    Console.WriteLine(col2 + ", " + col5 + ", " + col6);
}

Additional Class Split:

public class Split
{
    public static List<string> Extract(string source, string splitStart, string splitEnd)
    {
        try
        {
            var results = new List<string>();

            string[] start = new string[] { splitStart };
            string[] end = new string[] { splitEnd };
            string[] temp = source.Split(start, StringSplitOptions.None);

            for (int i = 1; i < temp.Length; i++)
            {
                results.Add(temp[i].Split(end, StringSplitOptions.None)[0]);
            }

            return results;
        }
        catch (Exception e)
        {
            throw new Exception(e.Message);
        }
    }
}
M. Schena
  • 2,039
  • 1
  • 21
  • 29
  • 1
    Even better then the old methods I was trying to use, it cut the the operation time from 4.5 seconds (or about 4.75 - 5 seconds coming from my very first method) to ~450ms. You are a god send. Thank you very much. –  Feb 19 '16 at 12:03