Removing html contents from a web request using C#

Question

I have the following code in C# which gets the contents of a web page and stores them in a string variable.

WebRequest request = WebRequest.Create("http://www.arsenal.com");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
    html = sr.ReadToEnd();
}

The code works properly but m I need to store the content of the page without the html tags and Javascript stuff. Is there any way to do so (any built-in method or something ready for such things)?
Actually I have found some ways for removing html tags but Javascript and CSS styles still bother me. I have to mention that the way for removing html is also not working well, I'm using regular expressions for doing so.

Find a library to do it. You will be entering a world of pain if you try writing something yourself. Probably a good time to post this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — GrandMasterFlush, Nov 10 '16 at 17:24
@GrandMasterFlush I was looking for a library too, but did not find anything — Green Falcon, Nov 10 '16 at 17:26
http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c might be worth a look then. I've used HTMLAgility pack before. — GrandMasterFlush, Nov 10 '16 at 17:27

score 2 · Accepted Answer · edited May 23 '17 at 10:32

2

As this question suggests, it's a tricky process parsing HTML and the best approach is to use a library.

I've used the HTML Agility Pack before with some success though this question lists some other options.

edited May 23 '17 at 10:32

Community

1
1

answered Nov 10 '16 at 17:39

GrandMasterFlush

6,269
19
81
104

Removing html contents from a web request using C#

1 Answers1