New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?
7 Answers
To get the HTML you can use the WebClient object.
To parse the HTML you can use HTMLAgility librrary.

- 7,268
- 1
- 32
- 44
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Then use Xquery expressions or Regex to grab the element you need

- 5,846
- 6
- 25
- 40
You could use System.Net.WebClient
or System.Net.HttpWebrequest
to fetch the page but parsing for the elements is not supported by the classes.
Use HtmlAgilityPack (http://html-agility-pack.net/)
HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;
HtmlDocument htmlDocument = htmlWeb.Load(url);
// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{
// item mathces your req
// take the item.
}

- 1,970
- 1
- 31
- 44

- 4,653
- 29
- 31
I hear you want to use the HtmlAgilityPack
for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient
.

- 67,947
- 40
- 154
- 200
To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.
Edit:
Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse
Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.

- 1
- 1

- 42,906
- 18
- 101
- 138
Look into using the html agility pack, which is one of the more common libraries for parsing html.

- 1,691
- 4
- 20
- 33