Filtering the scraped data using c#

Question

I have succesfully scraped a data from websites page. But it contain both the HTML tags aswell as plain text. How can i filter the unwanted data (tags,scripts,some text which is not required,etc) from this scraped data. Atleast suggest some approach for doing it.

score 1 · Accepted Answer · edited May 23 '17 at 10:09

1

You can use HTML Agility Pack to parse the html and remove any unwanted takes.

How to use HTML Agility Pack

edited May 23 '17 at 10:09

Community

1
1

answered Jul 04 '12 at 05:51

Asif Mushtaq

13,010
3
33
42

score 1 · Answer 2 · answered Jul 04 '12 at 05:52

You can start by taking a look at the HTML Agility Pack. This should allow you to remove any HTML.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Filtering the scraped data using c#

2 Answers2