0

I am downloading web pages using below lines of code,

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

then from here I am extracting body part as below:

int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));

Now I want to remove any javascript attached in the body part, How can I do that?

My aim to get the only contents of the web page. But as each page may have different approach, so I am trying to remove any js tags and then remove any HTML tags using below RegEx

Regex.Replace(strBody, @"<[^>]+>|&nbsp;", "").Trim();

But I don't know how to remove js between script tags as the script may be multi-line or single line.

Thanks in advance.

juan.facorro
  • 9,791
  • 2
  • 33
  • 41
Pratik Gaikwad
  • 1,526
  • 2
  • 21
  • 44
  • any time regex parsing of html comes up....this post always worth a read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – charlietfl Dec 09 '13 at 05:22
  • @GrantWinney I tried using that. But even though my URL is of 242 characters, it throws below exception: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters. – Pratik Gaikwad Dec 09 '13 at 05:25

2 Answers2

1

you can use HtmlAgilityPack

WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();

string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
  html = sr.ReadToEnd();
}

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);

// to remove all tags 
var result = document.DocumentNode.InnerText;

// to remove script tags inside body 
document.DocumentNode.SelectSingleNode("//body").Descendants()
                .Where(n => n.Name == "script")
                .ToList()
                .ForEach(n => n.Remove());
Damith
  • 62,401
  • 13
  • 102
  • 153
  • He wants to remove all tags (as in ``) as well and just keep the text. At least that's what I understood. – juan.facorro Dec 09 '13 at 05:23
  • I tried using that. But even though my URL is of 242 characters, it throws below exception: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters. – Pratik Gaikwad Dec 09 '13 at 05:26
  • @juan.facorro You are right. I want to remove all the tags. I just want to keep main data/content of the body. Not js functions, images or any extra things apart from content. – Pratik Gaikwad Dec 09 '13 at 05:28
  • @PratikGaikwad You already have download html as string, you can use that one and also if you want to remove all the tags use `document.DocumentNode.InnerText` – Damith Dec 09 '13 at 05:32
  • @Damith: I tried that as well, got again an exception : An unhandled exception of type 'System.ArgumentException' occurred in mscorlib.dll Additional information: Illegal characters in path. – Pratik Gaikwad Dec 09 '13 at 05:38
1

To match script tags (including the inside of the pair), use the following:

<script[^>]*>(.*?)</script>

To match all HTML tags (but not the inside of the pair) you can use:

</?[a-z][a-z0-9]*[^<>]*>


I just realised you might also want to remove style tags too:

<style[^>]*>(.*?)</style>


Full regular expression string here:

<script[^>]*>(.*?)</script>|<style[^>]*>(.*?)</style>|</?[a-z][a-z0-9]*[^<>]*>|<[^>]+>|&nbsp;

Pratik Gaikwad
  • 1,526
  • 2
  • 21
  • 44
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • the regex you provided removes only tags that too from single line only. What if the start and end of script tags are on different line? Also I want to remove the contents in between them. – Pratik Gaikwad Dec 09 '13 at 05:35
  • If you are matching with javascript, you may have to specify `[\r\n]` with any 'dot' matches. If you are using C#, it might be the same, or you can specify that dot matches newline with `(?s)` at the beginning of the regular expression. – Vasili Syrakis Dec 09 '13 at 05:36
  • You got me closer. But its not removing the contents between script tags. I don't contents between script tags as well. And I am coding in C#. – Pratik Gaikwad Dec 09 '13 at 05:42
  • I'm not 100% sure how to do this in C#, but I have a feeling that dot does not match newline by default. In regex, you specify the mode with `(?s)` like this: `(?s)(?:<(?:script|style)[^>]*>(.*?)(?:script|style)>|?[a-z][a-z0-9]*[^<>]*>)`, but if there is some part of C# that overrides it, it will not work. For example, if the function which performs the match itself will only read single lines, you'll have to use a different function. – Vasili Syrakis Dec 09 '13 at 05:49
  • Thanks for all the help. I ended up using your old reg ex with little modifications. So final Reg-ex is as follows '||?[a-z][a-z0-9]*[^<>]*>|<[^>]+>| ' – Pratik Gaikwad Dec 09 '13 at 06:18