3

Possible Duplicate:
Parsing web pages

I am trying to parse the content of web-page in C#. This is the code that I use:

WebRequest request = WebRequest.Create("URL");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
    html = sr.ReadToEnd();
}

but the problem is that I get all data that the html contains.

Do you have any suggestion on how to take useful data in a 'clean' way or I have to build my own parser? For example: A post containing a title and a text related to it, blog-like format.

Community
  • 1
  • 1
NiVeR
  • 9,644
  • 4
  • 30
  • 35

3 Answers3

5

If you are indeed trying to parse blog posts from a web page do not do it that way, don't even think of using the HTML Agility Pack.

Instead you should use the SyndicationFeed and related classes that are already built into the .Net framework (since v3.5). These are tailor made for consuming and ripping apart RSS feeds.

slugster
  • 49,403
  • 14
  • 95
  • 145
  • Yes, but only if the website supports publishing/distributing them in this format (as every modern blogging site does) – Beachwalker Jan 16 '13 at 14:46
4

simply use the Html Agility Pack. It's so powerfull !

You can find many tutorials on internet suc as http://runtingsproper.blogspot.fr/2009/09/htmlagilitypack-article-series.html

Cybermaxs
  • 24,378
  • 8
  • 83
  • 112
1

Use a Regex. To parse data between two tags (which I assume you want to do) you could, for example do something like this:

string match = Regex.Match(data, string.Format("<a>(?<inbetween>.+?)</a>")).Groups["inbetween"].Value;

Using a Regex, unlike the agility pack does not require an external dependency which is great for portable, stand-alone applications.

Caster Troy
  • 2,796
  • 2
  • 25
  • 45
  • 1
    You probably need to read this response http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ilya Ivanov Jan 16 '13 at 12:23
  • 1
    @llya lvanov I'm aware of the opinion but I don't see why regular expressions cannot be used in this context. I know this code is not robust as it only accounts for one match which isn't realistic but despite that, I still don't see why a regular expression would fail in a proper environment. – Caster Troy Jan 16 '13 at 12:28
  • 1
    `proper environment` is kind of key term here. Web is **not** the proper environment. Anyway, I don't want to be insulting, there just a huge amount of good and simple tools for html, which are, well, a little bit better then regular expressions. – Ilya Ivanov Jan 16 '13 at 12:31