.NET regex inner text between td, span, a tag

Question

<table >
    <tr>
        <td colspan="2" style="height: 14px">
            tdtext1
            <a>hyperlinktext1<a/> 
        </td>
    </tr>
    <tr>
        <td>
            tdtext2
        </td>
        <td>
            <span>spantext1</span>
        </td>
    </tr>
</table>

This is my sample text. How to write a regular expression in C# to get the matches for the innertext for td, span, hyperlinks.

Regular expressions are part of the .NET Framework. They work the same for C#, VB.NET, F#, or any .NET language. — John Saunders, May 21 '10 at 03:08

score 8 · Accepted Answer · edited Nov 25 '17 at 14:46

8

I cringe every time I hear the words regex and HTML in the same sentence. I would suggest checking out the HtmlAgilityPack on CodePlex which is a very tolerant HTML parser that lets you use XPath queries against the parsed document. It's much cleaner and the person that inherits your code will thank you!

EDIT

As per the comments below, here's some examples of how to get the InnerText of those tags. Very simple.

var doc = new HtmlDocument();
doc.LoadHtml("...your sample html...");

// all <td> tags in the document
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//td")) {
    Console.WriteLine(td.InnerText);
}

// all <span> tags in the document
foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span")) {
    Console.WriteLine(span.InnerText);
}

// all <a> tags in the document
foreach (HtmlNode a in doc.DocumentNode.SelectNodes("//a")) {
    Console.WriteLine(a.InnerText);
}

edited Nov 25 '17 at 14:46

carla

1,970
1
31
44

answered May 20 '10 at 06:38

Josh

68,005
14
144
156

Can you help me with the Xpath queries for the above parsing requirement. – mushtaqck May 20 '10 at 06:41
I added a code example. I don't know how complex your XPath requirements are but I guarantee you it'll be much easier with XPath than Regex. – Josh May 20 '10 at 06:51
1

Lucky you only cringe, unlike this guy, who totally lost it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Igor Zevaka May 20 '10 at 06:53
@Igor, oh my god that is hilarious. I bet a hooker got murdered that night. – Josh May 20 '10 at 06:55

score 1 · Answer 2 · answered Feb 13 '13 at 10:38

        static void Main(string[] args)
    {
        //...
       // using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
       // {
        HtmlDocument doc = new HtmlWeb().Load("http://www.freeclup.com");

            foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span"))
            {
                Console.WriteLine(span.InnerText);
            }
            Console.ReadKey();

      //  }
    }

score 0 · Answer 3 · answered May 21 '10 at 03:06

0

You could use something like:

        const string pattern = @"[a|span|td]>\s*?(?<text>\w+?)\s*?</\w+>";
        Regex regex = new Regex(pattern, RegexOptions.Singleline);
        MatchCollection m = regex.Matches(x);
        List<string> list = new List<string>();

        foreach (Match match in m)
        {
            list.Add(match.Groups["text"].Value);
        }

answered May 21 '10 at 03:06

Some User

37
4

-1: you didn't try this, did you? You also didn't get the point that, in general, you cannot use regular expressions against HTML. – John Saunders May 21 '10 at 03:10
Did you realize that HTML is not a regular language, so regular expressions don't work in all cases? And, BTW, the `[abc]` syntax means, `a` or `b` or `c`, and so does `[a-c]`. You meant to use parentheses there: `(a | span | td)` is what you wanted. – John Saunders May 21 '10 at 07:31
The XPath example could easily be reduced to one loop using the same alternation expression. I believe you also misspelled douche. – Josh May 21 '10 at 11:44
Parse [X]HTML with regex? Blasphemy! Read: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Marcel Valdez Orozco Jan 06 '12 at 08:47

.NET regex inner text between td, span, a tag

3 Answers3