3
<table >
    <tr>
        <td colspan="2" style="height: 14px">
            tdtext1
            <a>hyperlinktext1<a/> 
        </td>
    </tr>
    <tr>
        <td>
            tdtext2
        </td>
        <td>
            <span>spantext1</span>
        </td>
    </tr>
</table>   

This is my sample text. How to write a regular expression in C# to get the matches for the innertext for td, span, hyperlinks.

Amirhossein Mehrvarzi
  • 18,024
  • 7
  • 45
  • 70
mushtaqck
  • 77
  • 2
  • 6

3 Answers3

8

I cringe every time I hear the words regex and HTML in the same sentence. I would suggest checking out the HtmlAgilityPack on CodePlex which is a very tolerant HTML parser that lets you use XPath queries against the parsed document. It's much cleaner and the person that inherits your code will thank you!

EDIT

As per the comments below, here's some examples of how to get the InnerText of those tags. Very simple.

var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");

// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td")) {
    Console.WriteLine(td.InnerText);
}

// all <span> tags in the document
foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span")) {
    Console.WriteLine(span.InnerText);
}

// all <a> tags in the document
foreach (HtmlNode a in doc.DocumentNode.SelectNodes("//a")) {
    Console.WriteLine(a.InnerText);
}
carla
  • 1,970
  • 1
  • 31
  • 44
Josh
  • 68,005
  • 14
  • 144
  • 156
  • Can you help me with the Xpath queries for the above parsing requirement. – mushtaqck May 20 '10 at 06:41
  • I added a code example. I don't know how complex your XPath requirements are but I guarantee you it'll be much easier with XPath than Regex. – Josh May 20 '10 at 06:51
  • 1
    Lucky you only cringe, unlike this guy, who totally lost it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Igor Zevaka May 20 '10 at 06:53
  • @Igor, oh my god that is hilarious. I bet a hooker got murdered that night. – Josh May 20 '10 at 06:55
1
        static void Main(string[] args)
    {
        //...
       // using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
       // {
        HtmlDocument doc = new HtmlWeb().Load("http://www.freeclup.com");

            foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span"))
            {
                Console.WriteLine(span.InnerText);
            }
            Console.ReadKey();

      //  }
    }
MAG TOR
  • 129
  • 1
  • 3
0

You could use something like:

        const string pattern = @"[a|span|td]>\s*?(?<text>\w+?)\s*?</\w+>";
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        MatchCollection m = regex.Matches(x);
        List<string> list = new List<string>();

        foreach (Match match in m)
        {
            list.Add(match.Groups["text"].Value);
        }
Some User
  • 37
  • 4
  • -1: you didn't try this, did you? You also didn't get the point that, in general, you cannot use regular expressions against HTML. – John Saunders May 21 '10 at 03:10
  • Did you realize that HTML is not a regular language, so regular expressions don't work in all cases? And, BTW, the `[abc]` syntax means, `a` or `b` or `c`, and so does `[a-c]`. You meant to use parentheses there: `(a | span | td)` is what you wanted. – John Saunders May 21 '10 at 07:31
  • The XPath example could easily be reduced to one loop using the same alternation expression. I believe you also misspelled douche. – Josh May 21 '10 at 11:44
  • Parse [X]HTML with regex? Blasphemy! Read: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Marcel Valdez Orozco Jan 06 '12 at 08:47