Extract the contents of a string between two string delimiters using match in C#

Question

So, say I'm parsing the following HTML string:

<html>
    <head>
        RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
    </head>
    <body>
        <table class="table">
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
        </table>
    <body>
</html>

and I want to isolate the contents of ** (everything inside of the table class)

Now, I used regex to accomplish this:

string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);

I've been looking at other ways to do this extraction, and I came across the Match object in C#.

I'm attempting to do something like this:

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");

The purpose of the above was to hopefully extract a match value between the two delimiters, but, when I try to run it the match value is:

match.value = </table>

MY question, as such, is: is there a way to extract data from my string that is slightly easier/more readable/shorter than my method using regex? For this simple example, regex is fine, but for more complex examples, I find myself with the coding equivalent of scribbles all over my screen.

I would really like to use match, because it seems like a very neat and tidy class, but I can't seem to get it working for my needs. Can anyone help me with this?

Thank you very much!

One small note: the portion of your regex between the two table tags should read `(.|\n)*?`. If you don't put parenthesis around `.|\n`, then the `*?` will only apply to the character before it (\n in this case). — Jon Senchyna, Jun 13 '12 at 13:13
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — jrummell, Jun 13 '12 at 13:13
[Don't parse HTMl with regex](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) — Shai, Jun 13 '12 at 13:13
Yeah yeah, I typed the html up and wasn't paying attention =p. — gfppaste, Jun 13 '12 at 13:17
Also, the class of your table does not match the class in your regex. — Jon Senchyna, Jun 13 '12 at 13:19
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."([Jamie Zawinski](http://en.wikiquote.org/wiki/Jamie_Zawinski#Attributed)) — Filburt, Jun 13 '12 at 13:24

score 3 · Accepted Answer · answered Jun 13 '12 at 13:13

3

Use an HTML parser, like HTML Agility Pack.

var doc = new HtmlDocument();

using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
    doc.Load(stream);
}

var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;

answered Jun 13 '12 at 13:13

Thomas Levesque

286,951
70
623
758

I'm actually trying HTML agility pack, but the lack of documentation is terrifying! and the new downloadable doesn't have a chm, so, to find help, I'm basically looking through the manifest that came with the downloadable... all in all, it does not make for a friendly experience! – gfppaste Jun 13 '12 at 13:16
@gfppaste, there's no real need for documentation, the API is quite self-explanatory and very similar to Linq to XML. I learned to use it by using Intellisense, it's quite intuitive. – Thomas Levesque Jun 13 '12 at 13:18

score 0 · Answer 2 · answered Jun 13 '12 at 13:19

You can use XPath with the HTmlAgilityPack:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[@class='table']");

foreach (var ele in elements)
{
    MessageBox.Show(ele.OuterHtml);
}

Andrei Bozantan · Answer 3 · 2012-06-13T13:27:21.050

0

You have add parenthesis in the regular expression in order to capture the matches:

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");

Anyways it seems that only Chuck Norris can parse HTML with regex correctly.

edited Jun 13 '12 at 13:27

answered Jun 13 '12 at 13:20

Andrei Bozantan

3,781
2
30
40

Extract the contents of a string between two string delimiters using match in C#

3 Answers3

Linked