0

How can I parse a complete HTML website in C#

Little Example

<html>
 <head></head>
 <body>
  <div class="wrapper">
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
  </div>
 </body>
</html>

I can not use the classes of the page to ident. the container, because they are variable.

Now I want to save the values.

My Code now:

WebBrowser wb = (WebBrowser)sender;

var doc = wb.Document as HTMLDocument;

IHTMLElementCollection nodes = doc.getElementsByTagName("div");

foreach(IHTMLElement elem in nodes)
{
    var div = (HTMLDivElement)elem;

    if(div.className != null && div.className.Contains("t_row"))
    {
        //BREAKPOINT
        var inner = div.document as HTMLDocument;
        IHTMLElementCollection innerNode = inner.getElementsByTagName("div");

        log(div.innerText);
    }
}

Till the breakpoint everything works fine, but till there I dont know how I need to go on.

Philipp Nies
  • 945
  • 4
  • 20
  • 38
  • 2
    Depending on how not-well-formed-xml your Html page really is, you should consider using [HTML Agility Pack](http://stackoverflow.com/q/846994/205233) to parse. – Filburt Mar 16 '16 at 16:24

1 Answers1

-1

You can extract data using WebsiteParser. It's usage is similar to parse libs. For your example html it would something like this:

IEnumerable<WrapperItem> items = WebContentParser.ParseList<WrapperItem>(html);

// ...

[ListSelector(".wrapper", ChildSelector = ".row")]
class WrapperItem
{
    [Selector("div:nth-child(1)")]
    public string Value1 { get; set; }

    [Selector("div:nth-child(2)")]
    public string Value2 { get; set; }
}

To download website's html you can use WebClient

WebClient client = new WebClient ();
string html = client.DownloadString("https://example.com");
jasniec
  • 55
  • 1
  • 10