-1

I am doing string matching in ASP.NET C#, I have to convert HTML and .aspx page into plain text format (like browser view text), in that HTML page I'm having <style>, <javascript> and etc. I'm using Regex.Replace method.

//Removing JavaScripts
str = Regex.Replace(str, "<script.*?>.*?</script>", "", RegexOptions.Singleline);

//For Link Title
string regex = @"(?<=<title.*>)([\s\S]*)(?=</title>)";
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
string title = ex.Match(str).Value.Trim();

//Removing Html Tags
str = System.Text.RegularExpressions.Regex.Replace(str, "<.*?>", "");
str = str.Replace("\r\n", "");
Richard
  • 106,783
  • 21
  • 203
  • 265
vasmay
  • 1,419
  • 3
  • 13
  • 18
  • Do you mean you have to display it rendered/parsed like a browser does ? – Steven Ryssaert Mar 03 '11 at 09:56
  • 1
    I don't really follow your question can you expand on it? You mention you are using regex.replace, are you having problems with this? – Fishcake Mar 03 '11 at 09:57
  • 1
    Regex is the *wrong* tool for parsing HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Richard Mar 03 '11 at 09:59
  • What are the problems you are facing? You need to expand your question, Nobody will understand your question if you do not supply the details – coder_bro Mar 03 '11 at 10:01
  • just i have to change the html text like a browser viewable text... by using regex control.... – vasmay Mar 03 '11 at 10:06
  • Lots of ideas [in this old question](http://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c) – Rup Mar 03 '11 at 10:25

1 Answers1

0

You can't use Regex to strip HTML. You need an HTML parsing library. I've used the HTML Agility Pack successfully in the past.

http://htmlagilitypack.codeplex.com/

System Down
  • 6,192
  • 1
  • 30
  • 34