1

I use the below regex to replace text between two words. It works, except that it skips some of them. Pasted below is an example.

var EditedHtml = Regex.Replace(htmlText, @"<script(.*?)</script>", ""); 

htmlText :

 <head>
   <script src=" https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js" type="text/javascript"></script>
   <script src=" https://ajax.googleapis.com/ajax/libs/jqueryui/1.8.18/jquery-ui.min.js" type="text/javascript"></script>
   <script src="/AspellWeb/v2/js/dragiframe.js" type="text/javascript"></script>
   <script type="text/javascript">
     var applicationName = '/';
     FullPath = (applicationName.length > 1) ? 'http://localhost:65355' + applicationName : 'http://localhost:65355';
     //FullPath = 'http://localhost:65355';
     GetPath = function (url) {
     return FullPath + url;
   }
   </script>

   <script type="text/javascript" src="../../Scripts/stats.js?"></script>
</head>

<body>
  .......
  <script type="text/javascript">
    function loadAndInit() {

    $(".dvloading").hide();
    if ($.browser.mozilla) {
      if (location.pathname == "/Stats/Reports") {            // This is for local env.
        $("#prntCss").attr("href", "../../../Content/SitePrint_FF.css");
      }
      else {                                                  // This is for DEV/QA/STAGE/PROD env. 
        $("#prntCss").attr("href", "../../Content/SitePrint_FF.css");
      }
    }

  }
  </script>
</body>

EditedHtml :

<head>
  <script type="text/javascript">
    var applicationName = '/';
    FullPath = (applicationName.length > 1) ? 'http://localhost:65355' + applicationName : 'http://localhost:65355';
    //FullPath = 'http://localhost:65355';
    GetPath = function (url) {
      return FullPath + url;
    }
  </script>
</head>

<body>
  .......
  <script type="text/javascript">
    function loadAndInit() {

      $(".dvloading").hide();
      if ($.browser.mozilla) {
        if (location.pathname == "/Stats/Reports") {            // This is for local env.
          $("#prntCss").attr("href", "../../../Content/SitePrint_FF.css");
        }
        else {                                                  // This is for DEV/QA/STAGE/PROD env. 
          $("#prntCss").attr("href", "../../Content/SitePrint_FF.css");
        }
      }

    }
  </script>
</body>
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
BumbleBee
  • 10,429
  • 20
  • 78
  • 123

4 Answers4

4

Why do you use Regex to parse html. See this

It would be much easier to use a real html parser like HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(filename); //or doc.LoadHtml(HtmlString)

doc.DocumentNode.Descendants()
    .Where(n => n.Name == "script").ToList()
    .ForEach(s => s.Remove());

StringWriter wr = new StringWriter();
doc.Save(wr);
var newhtml = wr.ToString();
Community
  • 1
  • 1
I4V
  • 34,891
  • 6
  • 67
  • 79
  • doc.load throws "illegal characters in path" exception. Should be doc.loadHtml() – BumbleBee Apr 16 '13 at 23:12
  • 1
    @BumbleBee `doc.load` requires a filename. if you want load a string then you should use `doc.LoadHtml` as i commented in the answer. – I4V Apr 16 '13 at 23:14
2

Try it in single line mode:

var EditedHtml = Regex.Replace(
    htmlText, @"<script(.*?)</script>", "", 
    RegexOptions.Singleline); 

Documentation quote:

Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

Community
  • 1
  • 1
Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
  • Why do people insist on parsing html with regex? Just a simple case ` test`. My browser shows "**test**" for this html. But your regex removes *test* from it – I4V Apr 16 '13 at 23:00
  • 2
    My regex? This is OP's regex. I'm not passing judgement on OP's choice of tool for his job, I'm just correcting his code. I agree that a proper parser would be better for robustness, but a quick and dirty regex is fine sometimes. Maybe the html follows a known format, maybe it's a one-off script. – Blorgbeard Apr 16 '13 at 23:16
2

Try

var EditedHtml = Regex.Replace(
    htmlText, @"<script(.*?)</script>", "", RegexOptions.Singleline
); 

Use singleline mode so the . matches any character including newlines.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • Why do people insist on parsing html with regex? Just a simple case ` test`. My browser shows "**test**" for this html. But your regex removes *test* from it – I4V Apr 16 '13 at 23:01
0

Try this:

//(.|\r\n)*: matches every character and/or newline zero or more times
//(.|\r\n)*?: as few times as possible == > you get rid of <script> tags and of their content but you keep the rest of your html
var EditedHtml = Regex.Replace(htmlText, @"<script (.|\r\n)*?</script>", ""); 

Hope it helps

References: http://msdn.microsoft.com/en-us/library/az24scfc.aspx

Andrea Scarcella
  • 3,233
  • 2
  • 22
  • 26
  • In .NET regexes, `.` matches every character except linefeed (`\n`), so you would only have to use `(.|\n)*?`. But it's easier and more efficient to use `.*?` and specify `Singleline` mode, as others have suggested. – Alan Moore Apr 16 '13 at 23:40
  • Thanks for your feedback, I must admit that it escapes me why using Singleline mode is more efficient, could you please clarify this point? – Andrea Scarcella Apr 17 '13 at 00:14
  • 1
    First, you have to enclose it in a group, so you've got the extra overhead of entering and leaving the group every time you consume a character. And you're using a *capturing* group, which adds even more overhead. Second, alternation itself tends to be less efficient than an equivalent character class. `(.|\n)` is so simple the regex engine can probably optimize it away, but a more complicated alternation can easily bring the engine to its knees, as [this answer](http://stackoverflow.com/a/2408599/20938) explains. – Alan Moore Apr 17 '13 at 08:12
  • 'Fraid I can't help you with that. You might try a .NET-specific discussion forum; I'm sure there are many of them out there. But this is definitely not the place for questions like that. – Alan Moore Apr 17 '13 at 08:42