1

Need help..Why do i get an ArgumentException was Unhandle. the error shows Unrecognized grouping construct. Is my pattern wrong?

   WebClient client = new WebClient();
            string contents = client.DownloadString("http://site.com");

                string pattern =@"<td>\s*(?<no>\d+)\.\s*</td>\s*<td>\s*
                        <a class=""LN"" href=""[^""]*+"" 
                        onclick=""[^""]*+"">\s*+<b>(?<name>[^<]*+)
                        </b>\s*+</a>.*\s*</td>\s*+ 
                        <td align=""center"">[^<]*+</td>
                        \s*+<td>\s*+(?<locations>(?:<a href=""[^""]*+"">[^<]*+</a><br />\s*+)++)</td>";

            foreach (Match match in Regex.Matches(contents, pattern, RegexOptions.IgnoreCase))
            {
                string no = match.Groups["no"].Value;
                string name = match.Groups["name"].Value;
                string locations = match.Groups["locations"].Value;

                Console.WriteLine(no+" "+name+" "+locations);
            }
Cindy93
  • 1,210
  • 1
  • 11
  • 26
  • 1
    Fun fact: Using verbatim string literals allows you to span your string across multiple lines. You don't need to keep concatenating strings on each line. – Dave Zych Oct 25 '13 at 03:35
  • regex is not used for parsing html..Use an html parser like htmlagilitypack!There are 1000's of cases for this code to break..Please don;t use regex – Anirudha Oct 25 '13 at 03:38
  • 1
    The obligatory [link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) against parsing html with regex – Mark Hall Oct 25 '13 at 03:42
  • can you give me link @Anirudh? – Cindy93 Oct 25 '13 at 04:00
  • You could start with the HtmlAgilityPack on Codeplex http://htmlagilitypack.codeplex.com/ – Mark Hall Oct 25 '13 at 04:10

1 Answers1

1

There's no such thing as ?P<name> in C#/.NET. The equivalent syntax is just ?<name>.

The P named group syntax is from PCRE/Python (and Perl allows it as an extension).

You'll also need to remove all nested quantifiers (i.e. change *+ to * and ++ to +). If you want to get the exact same behavior you can switch X*+ to (?>X*), and likewise with ++.

Here is your regex, modified. I've tried to comment it a bit too, but I can't guarantee I did so without breaking it.

new Regex(
@"<td>                   # a td element
    \s*(?<no>\d+)\.\s*   # containing a number captured as 'no'
  </td>\s*
  <td>\s*                # followed by another td, containing
                         # an <a href=... onclick=...> exactly
      <a class=""LN"" href=""(?>[^""]*)"" onclick=""(?>[^""]*)""> 
         (?>\s*)                   # which contains
         <b>(?<name>(?>[^<]*))</b> # some text in bold captured as 'name'
         (?>\s*)
      </a>
      .*                 # and anywhere later in the document
      \s*
  </td>                  # the end of a td, followed by whitespace
  (?>\s*)   
  <td align=""center"">  # after a <td align=center> containing no other elements
    (?>[^<]*)
  </td>
  (?>\s*)
  <td>                   # lastly 
    (?>\s*)
    (?<locations>        # a series of <a href=...>...</a><br/>
        (?>(?:           # captured as 'locations'
            <a href=""(?>[^""]*)"">(?>[^<]*)</a>
            <br />
            (?>\s*)
            )
        +))              # (containing at least one of these)
  </td>", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase)

But you really should use something like the HTML Agility Pack.

porges
  • 30,133
  • 4
  • 83
  • 114
  • thanks @Porges,.:)...the error was gone but i don't get a result. i think the problem now is in my regex like what MarkHall said. – Cindy93 Oct 25 '13 at 04:02