0

I have succesfully scraped a data from websites page. But it contain both the HTML tags aswell as plain text. How can i filter the unwanted data (tags,scripts,some text which is not required,etc) from this scraped data. Atleast suggest some approach for doing it.

2 Answers2

1

You can use HTML Agility Pack to parse the html and remove any unwanted takes.

How to use HTML Agility Pack

Community
  • 1
  • 1
Asif Mushtaq
  • 13,010
  • 3
  • 33
  • 42
1

You can start by taking a look at the HTML Agility Pack. This should allow you to remove any HTML.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

npinti
  • 51,780
  • 5
  • 72
  • 96