0

I want to extract one part of html, ul with class="list-2"

<! DOCTYPE html>
<html>
    <title>Title</title>
    <body>
        <div>
            <ul class="list-1">
                <li class="item">1</li>
                <li class="item">2</li>
                <li class="item">3</li>
            </ul>
            <ul class="list-2">
                <li class="item">11</li>
                <li class="item">22</li>
                <li class="item">33</li>
            </ul>
            <ul class="list-1">
                <li class="item">111</li>
                <li class="item">222</li>
                <li class="item">333</li>
            </ul>
        </div>
    </body>
</html>

Here I extract all html from the page

string url = Request.QueryString["url"];
WebClient web = new WebClient();
web.Encoding = System.Text.Encoding.GetEncoding("utf-8");
string html = web.DownloadString(url);

Here I can delete the code until my ul

html = html.Remove(0, html.IndexOf("<ul class=\"list-2\">"));

How to get the code only from this ul?

thanks in advance!

Mohit S
  • 13,723
  • 6
  • 34
  • 69
Alex
  • 8,908
  • 28
  • 103
  • 157
  • 12
    Consider using Html Agility Pack – Andrei Feb 27 '14 at 17:30
  • 5
    Yes, seriously, use HtmlAgilityPack. It will take 30 minutes to learn the package, but you'll have it in your toolbox for the future. – trailmax Feb 27 '14 at 17:36
  • You should use one of the many (X)HTML parsers out there and select the elements of your interest through XPath. For the love of what's holy [do not use regular expressions](http://stackoverflow.com/a/1732454/91696). – Albireo Feb 27 '14 at 17:33

1 Answers1

2

Today, late 2015, there are a few more html parsers (and headless browsers) that can do this, AngleSharp, a parser, is one.

A note, when using the "WebClient", no javascript will be executed.

This sample extract the tag from a string (in this case the "string html"):

// --------- your code
string url = Request.QueryString["url"];
WebClient web = new WebClient();
web.Encoding = System.Text.Encoding.GetEncoding("utf-8");
string html = web.DownloadString(url);

// --------- parser code
var parser = new HtmlParser();
var document = parser.Parse(html);

//Get the tag with CSS selectors
var ultag = document.QuerySelector("ul.list-2");

// Get the tag's html string
var ultag_html = ultag.ToHtml();

This sample loads the web page and extract the tag:

// Setup the configuration to support document loading
var config = Configuration.Default.WithDefaultLoader();

// Load a web page
var address = "an url";

// Asynchronously get the document in a new context using the configuration
var document = await BrowsingContext.New(config).OpenAsync(address);

// This CSS selector gets the desired content
var cssSelector = "ul.list-2";

// Perform the query to get all tags with the content
var ultag = document.QuerySelector(cssSelector);

// Get the tag's html string
var ultag_html = ultag.ToHtml();

Further reading/downloading:

Asons
  • 84,923
  • 12
  • 110
  • 165