how can I get a simpler data

Question

I have matchCollection. And I need group index 1. Now I take out the data from a large number of casts, I would like to avoid it.

example: startTag = <a>, endTag = </a> Html = <a>texttexttext</a>.

I need get "texttexttext" with out <a> and </a>

 var regex = new Regex(startTag + "(.*?)" + endTag, RegexOptions.IgnoreCase);
 var matchCollection = regex.Matches(html);
 foreach (var item in matchCollection)
 {

      string temp = ((Match)(((Group)(item)).Captures.SyncRoot)).Groups[1].Value;
 }

Why is http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - HTML agility pack might be better — Marc Gravell, Jul 24 '11 at 12:42
You simply must not parse html with regex. It will call forth the old gods and bring madness and destruction upon mankind: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Andreas Ågren, Jul 24 '11 at 12:43

score 2 · Answer 1 · edited May 23 '17 at 11:47

I would recommend you using Html Agility Pack to parse HTML instead of regex for various reasons.

So to apply it to your example with finding all anchor text inside an HTML document:

using System;
using System.Net;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        string html = "";
        using (var client = new WebClient())
        {
            html = client.DownloadString("http://stackoverflow.com");
        }

        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a"))
        {
            // Will print all text contained inside all anchors 
            // on http://stackoverflow.com
            Console.WriteLine(link.InnerText);
        }
    }
}

Any reason for the downvote? Please leave a comment when downvoting. — Darin Dimitrov, Jul 24 '11 at 13:14

score 1 · Accepted Answer · answered Jul 24 '11 at 12:58

You could use a capture group. You might also want to use a named group. Notice the parentheses I added to regex.

        var html = "<a>xx yyy</a>   <a>bbb cccc</a>";
        var startTag = "<a>";
        var endTag = "</a>";
        var regex = new Regex(startTag + "((.*?))" + endTag, RegexOptions.IgnoreCase);
        var matchCollection = regex.Matches(html);
        foreach (Match item in matchCollection)
        {
            var data = item.Groups[1];
            Console.WriteLine(data);
        }

This is even a little nicer, because a named group is a little easier to grab.

        var html = "<a>xx yyy</a>   <a>bbb cccc</a>";
        var startTag = "<a>";
        var endTag = "</a>";
        var regex = new Regex(startTag + "(?<txt>(.*?))" + endTag, RegexOptions.IgnoreCase);
        var matchCollection = regex.Matches(html);
        foreach (Match item in matchCollection)
        {
            var data = item.Groups["txt"];
            Console.WriteLine(data);
        }

That I happen to razobratsya parser could not, in this is a good working variaant. — Mediator, Jul 24 '11 at 13:27

how can I get a simpler data

2 Answers2