Parsing the useful content of web page in C#

Question

Possible Duplicate:
Parsing web pages

I am trying to parse the content of web-page in C#. This is the code that I use:

WebRequest request = WebRequest.Create("URL");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
    html = sr.ReadToEnd();
}

but the problem is that I get all data that the html contains.

Do you have any suggestion on how to take useful data in a 'clean' way or I have to build my own parser? For example: A post containing a title and a text related to it, blog-like format.

What is 'useful information'? You will have to parse the response yourself. — Arran, Jan 16 '13 at 12:18
Useful is the actual information that the page contains, no hidden tags or other data. — NiVeR, Jan 16 '13 at 12:20
[Html Agility Pack](http://htmlagilitypack.codeplex.com/) is pretty good for this sort of stuff. Go and have a read, then play with it, and then come back if you have any specific problems — musefan, Jan 16 '13 at 12:20
You will need to use regex class to to ignore the parts you don't need. Actually you will need to build it your own! — Mohsen Rabieai, Jan 16 '13 at 12:20
Regex is not the fastest solution in each case and +1 @musefan. There are already parsers available to parse html. — Beachwalker, Jan 16 '13 at 12:24
do not reinvent the wheel by yourself ! many frameworks are already available — Cybermaxs, Jan 16 '13 at 12:31

score 5 · Accepted Answer · answered Jan 16 '13 at 12:29

5

If you are indeed trying to parse blog posts from a web page do not do it that way, don't even think of using the HTML Agility Pack.

Instead you should use the SyndicationFeed and related classes that are already built into the .Net framework (since v3.5). These are tailor made for consuming and ripping apart RSS feeds.

answered Jan 16 '13 at 12:29

slugster

49,403
14
95
145

Yes, but only if the website supports publishing/distributing them in this format (as every modern blogging site does) – Beachwalker Jan 16 '13 at 14:46

Cybermaxs · Answer 2 · 2013-01-16T12:27:37.117

4

simply use the Html Agility Pack. It's so powerfull !

You can find many tutorials on internet suc as http://runtingsproper.blogspot.fr/2009/09/htmlagilitypack-article-series.html

edited Jan 16 '13 at 12:27

answered Jan 16 '13 at 12:21

Cybermaxs

24,378
8
83
112

score 1 · Answer 3 · answered Jan 16 '13 at 12:22

1

Use a Regex. To parse data between two tags (which I assume you want to do) you could, for example do something like this:

string match = Regex.Match(data, string.Format("<a>(?<inbetween>.+?)</a>")).Groups["inbetween"].Value;

Using a Regex, unlike the agility pack does not require an external dependency which is great for portable, stand-alone applications.

answered Jan 16 '13 at 12:22

Caster Troy

2,796
2
25
45

1

You probably need to read this response http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ilya Ivanov Jan 16 '13 at 12:23
1

@llya lvanov I'm aware of the opinion but I don't see why regular expressions cannot be used in this context. I know this code is not robust as it only accounts for one match which isn't realistic but despite that, I still don't see why a regular expression would fail in a proper environment. – Caster Troy Jan 16 '13 at 12:28
1

`proper environment` is kind of key term here. Web is **not** the proper environment. Anyway, I don't want to be insulting, there just a huge amount of good and simple tools for html, which are, well, a little bit better then regular expressions. – Ilya Ivanov Jan 16 '13 at 12:31

Parsing the useful content of web page in C#

3 Answers3