2

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?

Peter
  • 101
  • 3

7 Answers7

2

To get the HTML you can use the WebClient object.

To parse the HTML you can use HTMLAgility librrary.

Maxim
  • 7,268
  • 1
  • 32
  • 44
2
// prepare the web page we will be asking for
        HttpWebRequest  request  = (HttpWebRequest)
            WebRequest.Create("http://www.stackoverflow.com");

        // execute the request
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        // we will read data via the response stream
        Stream resStream = response.GetResponseStream();

        string tempString = null;
        int    count      = 0;
        do
        {
            // fill the buffer with data
            count = resStream.Read(buf, 0, buf.Length);

            // make sure we read some data
                if (count != 0)
            {
            // translate from bytes to ASCII text
            tempString = Encoding.ASCII.GetString(buf, 0, count);

            // continue building the string
            sb.Append(tempString);
            }
        }
        while (count > 0); // any more data to read?

Then use Xquery expressions or Regex to grab the element you need

jaywayco
  • 5,846
  • 6
  • 25
  • 40
1

You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.

Use HtmlAgilityPack (http://html-agility-pack.net/)

HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;


HtmlDocument htmlDocument = htmlWeb.Load(url);


// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{ 
    // item mathces your req
    // take the item.
}
carla
  • 1,970
  • 1
  • 31
  • 44
Vijay Sirigiri
  • 4,653
  • 29
  • 31
0

I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
0

You can use Html Agility Pack to load html and find the element you need.

Giorgi
  • 30,270
  • 13
  • 89
  • 125
0

To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.


Edit:

Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse

Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.

Community
  • 1
  • 1
CodingWithSpike
  • 42,906
  • 18
  • 101
  • 138
0

Look into using the html agility pack, which is one of the more common libraries for parsing html.

http://htmlagilitypack.codeplex.com/

Tija
  • 1,691
  • 4
  • 20
  • 33