parse an HTML line for inner text

Question

I am trying to parse an html document by td tags using C# so that

<td>Whatever string</td><td class="pass">value</td>

would return

Whatever string : value

I have spent hours on this problem, trying XML parsers, and regular expressions, but to no avail. Thanks for your help.

I have already tried

    List<string> list = Regex.Split(lineslineWithTdTag[i], "[<td>].[<\td>]").ToList();
    List<string> status = Regex.Split(list[3], "[pass=\"].\"").ToList() ;

and then I tried parsing that list

What have you tried? If you post the code you're using we can help work out the problem. — Andrew Cooper, Jun 05 '14 at 15:39
You can't parse HTML with regex, you can't parse HTML with XMLParser (because it may not be valid XML unless it's XHTML). You need a raw HTML parser: flag is 0? - save name - set flag to 1, flag is 1? save value - set flag to 0 — Adriano Repetti, Jun 05 '14 at 15:40
HTML, though similar, is not XML so using a XML parser would not work. In regards to regex, I feel it's obligatory to link this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 HtmlAgilityPack is almost *always* the go-to solution for this kind of problem. Have you looked into that? — tnw, Jun 05 '14 at 15:41
html document should contain root node. Where you get these elements from? — Sergey Berezovskiy, Jun 05 '14 at 15:44
I have looked into HtmlAgilityPack, however due to restrictions at work I am unable to download any external libraries. — user3386190, Jun 05 '14 at 15:49
You really need to read the Regex documentation. Your syntax is all wrong. — Andrew Cooper, Jun 05 '14 at 16:29
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 — Icemanind, Jun 05 '14 at 17:15

score 0 · Answer 1 · answered Jun 05 '14 at 16:53

0

At the risk of incurring the wrath of the "you can't parse HTML with Regex" purists, here's a regex solution that should do what you want:

var match = Regex.Match(lineslineWithTdTag[I], "<td>(.*?)</td><td.*?>(.*?)</td>");
string result = String.Format(match.Groups[1].Value + " : " + match.Groups[2].Value);

Of course, if the actual documented is not as well formatted as your example then all bets are off.

answered Jun 05 '14 at 16:53

Andrew Cooper

32,176
5
81
116

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Icemanind Jun 05 '14 at 17:17
@icemanind - Yes, I saw that in the comments above, and love the answer. I agree that Regex cannot be used to parse HTML in general. However, for a subset of possible HTML cases, where the HTML is formatted predictably, it can be useful. – Andrew Cooper Jun 05 '14 at 17:24

score 0 · Answer 2 · answered Jun 06 '14 at 13:58

After a lot of work, this ended up being my solution

        string path = @"http://localhost/page.html";
        XDocument myX = XDocument.Load(path);
        string field1 = "";
        string field2 = "";
        bool flag = true;
        foreach (var name in myX.Root.DescendantNodes().OfType<XElement>())
        {
            // get the first element
            if (name.Name.LocalName == "td" && flag)
            {
                field1 = (string)name + "\n";
                flag = false;
            }
            // get the second element
            else if (name.Name.LocalName == "td")
            {
                field2 = (string)name + "\n";
                flag = true;
            }
        }
    }

parse an HTML line for inner text

2 Answers2