I am downloading web pages using below lines of code,
WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
then from here I am extracting body part as below:
int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));
Now I want to remove any javascript attached in the body part, How can I do that?
My aim to get the only contents of the web page. But as each page may have different approach, so I am trying to remove any js tags and then remove any HTML tags using below RegEx
Regex.Replace(strBody, @"<[^>]+>| ", "").Trim();
But I don't know how to remove js between script tags as the script may be multi-line or single line.
Thanks in advance.