5

I did this pattern to match nested divs:

(<div[^>]*>(?:\g<1>|.)*?<\/div>)

This works nicely, as you can see in regex101.

However, when I write the code below in C# :

Regex findDivs = new Regex("(<div[^>]*>(?:\\g<1>|.)*?<\\/div>)", RegexOptions.Singleline);

It throws me an error:

Additional information: 
    parsing "(<div[^>]*>(?:\g<1>|.)*?<\/div>)" - 
        Unrecognized escape sequence \g.

As you can see \g doesn't work in c#. How can I match the first subpattern then?

  • 5
    I like the top answer on this question when looking at attempting to match HTML using regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags In short `Don't Parse HTML With Regex` – austin wernli May 24 '16 at 18:30
  • First off you really should use a regular expression tester that specifically uses C# to ensure compatibility. Second check out this question http://stackoverflow.com/questions/19596502/regex-nested-parentheses – juharr May 24 '16 at 19:13

2 Answers2

3

What you are looking for is balancing groups. Here is a one-to-one conversion of your regex to .NET:

(?sx)<div[^>]*>                   # Opening DIV
    (?>                           # Start of atomic group
        (?:(?!</?div[^>]*>).)+    # (1) Any text other than open/close DIV
        |   <div[^>]*> (?<tag>)   # Add 1 "tag" value to stack if opening DIV found 
        |   </div> (?<-tag>)      # Remove 1 "tag" value from stack when closing DIV tag is found
    )*
    (?(tag)(?!))                  # Check if "tag" stack is not empty (then fail)
</div>

See the regex demo

However, you might really want to use HtmlAgilityPack to parse HTML.

The main point is to get an XPath that will match all DIV tags that have no ancestors with the same name. You might want something like this (untested):

private List<string> GetTopmostDivs(string html)
{
    var result = new List<KeyValuePair<string, string>>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]");
    if (nodes != null)
        return nodes.Select(p => p.OuterHtml).ToList();
    else
        return new List<string>();
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

What you want to do is iterate over the capture groups. Here is an example:

foreach (var s in test)
{
    Match match = regex.Match(s);

        foreach (Capture capture in match.Captures)
        {
            Console.WriteLine("Index={0}, Value={1}", capture.Index, capture.Value);
            Console.WriteLine(match.Groups[1].Value);
        }   
}
  • Sorry if I'm lazy.. how exactly can it help? – Washington Guedes May 24 '16 at 21:13
  • You can do the match and then you can look at the capture groups. You will see that one of them contains the values that you want. Fire up that debugger and take a look. –  May 24 '16 at 21:56