0

I need regex pattern for match values into html document. Any idea how is easy way to write pattern for this?

<tr class=a1>
<td>first</td>      //name
<TD>&nbsp;</TD>
<td>256</td>        //col1
<td>06</td>     //col2
<td>102</td>        //col3
<td>03</td>     //col4
<td>503</td>        //col5
<td>189</td>        //col6
<td>174</td>        //col7
<td>46</td>     //col8 ....
</tr>
<tr class=a1>
<td>second</td>
<TD>&nbsp;</TD>
<td>07</td>
<td>211</td>
<td>27</td>
<td>450</td>
<td>111</td>
<td>203</td>
<td>147</td>
<td>65</td>
</tr>           //....

I'm new with Regex. and I don't have any idea how to solve this here is my test code in C#

private void butTestReg2_Click(object sender, EventArgs e)
    {
        List<CollectNumbers> LiColl = new List<CollectNumbers>();
        string name = null;
        string col1 = null;
        string col2 = null;
        string col3 = null;
        string col4 = null;
        string col5 = null;
        string col6 = null;
        string col7 = null;
        string col8 = null;
        string html = wbrowser.DocumentText;
        string pattern = "???????";
        try
        {
            MatchCollection coll = Regex.Matches(html, pattern, RegexOptions.IgnoreCase);
            foreach (Match m in coll)
            {
                name = m.Groups["name"].ToString();
                col1 = m.Groups["c1"].ToString();
                col2 = m.Groups["c2"].ToString();
                col3 = m.Groups["c3"].ToString();
                col4 = m.Groups["c4"].ToString();
                col5 = m.Groups["c5"].ToString();
                col6 = m.Groups["c6"].ToString();
                col7 = m.Groups["c7"].ToString();
                col8 = m.Groups["c8"].ToString();
            }
            LiColl.Add(new CollectNumbers
            {
                name = name,
                col1 = col1,
                col2 = col2,
                col3 = col3,
                col4 = col4,
                col5 = col5,
                col6 = col6,
                col7 = col7,
                col8 = col8
            });
        }
        catch
        {

        }
    }
    public class CollectNumbers
    {
        public string name { get; set; }
        public string col1 { get; set; }
        public string col2 { get; set; }
        public string col3 { get; set; }
        public string col4 { get; set; }
        public string col5 { get; set; }
        public string col6 { get; set; }
        public string col7 { get; set; }
        public string col8 { get; set; }
    }
Rock
  • 41
  • 4
  • I admit in your case a `Regex` *might* be easy to use, but you never know if a numeral makes it's way into the html, like this: `
    `
    – pid Feb 03 '14 at 13:39
  • Given what is provided here though.. its pretty simple to use Regex. That's all we can go off.. what is available in the question. – Simon Whitehead Feb 03 '14 at 13:41
  • Yes, I agree Simon, it is very trivial. With `\d+` the problem is solved. *but...* :) you know. – pid Feb 03 '14 at 13:42

1 Answers1

0

Don't use a regular expression where regular expressions are not the right tool!

You can't or shouldn't do something like this.

Look here: Horror

As suggested in the comments, here is a link to another SO answer recommending a C# tool: HtmlAgilityPack

Community
  • 1
  • 1
pid
  • 11,472
  • 6
  • 34
  • 63
  • There is a difference between using Regex to parse/navigate HTML and ripping just numbers from some markup. The regex pattern `\d+` is sufficient here.. using something like HTML Agility Pack is actually overkill in this instance. – Simon Whitehead Feb 03 '14 at 13:39
  • I agree, but I'm also quite conservative, look at my comment below yours. How can you guarantee that those cases won't happen if you can't control the remote server? – pid Feb 03 '14 at 13:41
  • @SimonWhitehead Using a tool specifically designed to solve the problem you want solved is never overkill. `\d+` is not sufficient because we aren't parsing a single line, we're parsing an entire document. OP apparently wants to pull specific columns of a specific row in a specific table on the page, and Regex is **not** the tool for navigating to such information. – nmclean Feb 03 '14 at 13:50
  • In any case, this answer would be more useful if it also contained some pointer to a tool, such as HtmlAgilityPack and maybe a small example of how to do what OP is asking with that tool. – Dmitriy Khaykin Feb 03 '14 at 14:12