-3

I am trying to parse an html document by td tags using C# so that

<td>Whatever string</td><td class="pass">value</td>

would return

Whatever string : value

I have spent hours on this problem, trying XML parsers, and regular expressions, but to no avail. Thanks for your help.

I have already tried

    List<string> list = Regex.Split(lineslineWithTdTag[i], "[<td>].[<\td>]").ToList();
    List<string> status = Regex.Split(list[3], "[pass=\"].\"").ToList() ;

and then I tried parsing that list

  • 1
    You need to show whatever code you have tried. – Donal Jun 05 '14 at 15:39
  • 1
    What have you tried? If you post the code you're using we can help work out the problem. – Andrew Cooper Jun 05 '14 at 15:39
  • Have you tried the HtmlAgilityPack? – Andrew Cooper Jun 05 '14 at 15:40
  • You can't parse HTML with regex, you can't parse HTML with XMLParser (because it may not be valid XML unless it's XHTML). You need a raw HTML parser: flag is 0? - save name - set flag to 1, flag is 1? save value - set flag to 0 – Adriano Repetti Jun 05 '14 at 15:40
  • HTML, though similar, is not XML so using a XML parser would not work. In regards to regex, I feel it's obligatory to link this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 HtmlAgilityPack is almost *always* the go-to solution for this kind of problem. Have you looked into that? – tnw Jun 05 '14 at 15:41
  • html document should contain root node. Where you get these elements from? – Sergey Berezovskiy Jun 05 '14 at 15:44
  • I have looked into HtmlAgilityPack, however due to restrictions at work I am unable to download any external libraries. – user3386190 Jun 05 '14 at 15:49
  • You really need to read the Regex documentation. Your syntax is all wrong. – Andrew Cooper Jun 05 '14 at 16:29
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Icemanind Jun 05 '14 at 17:15

2 Answers2

0

At the risk of incurring the wrath of the "you can't parse HTML with Regex" purists, here's a regex solution that should do what you want:

var match = Regex.Match(lineslineWithTdTag[I], "<td>(.*?)</td><td.*?>(.*?)</td>");
string result = String.Format(match.Groups[1].Value + " : " + match.Groups[2].Value);

Of course, if the actual documented is not as well formatted as your example then all bets are off.

Andrew Cooper
  • 32,176
  • 5
  • 81
  • 116
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Icemanind Jun 05 '14 at 17:17
  • @icemanind - Yes, I saw that in the comments above, and love the answer. I agree that Regex cannot be used to parse HTML in general. However, for a subset of possible HTML cases, where the HTML is formatted predictably, it can be useful. – Andrew Cooper Jun 05 '14 at 17:24
0

After a lot of work, this ended up being my solution

        string path = @"http://localhost/page.html";
        XDocument myX = XDocument.Load(path);
        string field1 = "";
        string field2 = "";
        bool flag = true;
        foreach (var name in myX.Root.DescendantNodes().OfType<XElement>())
        {
            // get the first element
            if (name.Name.LocalName == "td" && flag)
            {
                field1 = (string)name + "\n";
                flag = false;
            }
            // get the second element
            else if (name.Name.LocalName == "td")
            {
                field2 = (string)name + "\n";
                flag = true;
            }
        }
    }