22

I am trying to remove JavaScript from the HTML.

I can't get the regular expression to work with PHP; it's giving me an null array. Why?

<?php
$var = '
<script type="text/javascript"> 
function selectCode(a) 
{ 
   var e = a.parentNode.parentNode.getElementsByTagName(PRE)[0]; 
   if (window.getSelection) 
   { 
      var s = window.getSelection(); 
       if (s.setBaseAndExtent) 
      { 
         s.setBaseAndExtent(e, 0, e, e.innerText.length - 1); 
      } 
      else 
      { 
         var r = document.createRange(); 
         r.selectNodeContents(e); 
         s.removeAllRanges(); 
         s.addRange(r); 
      } 
   } 
   else if (document.getSelection) 
   { 
      var s = document.getSelection(); 
      var r = document.createRange(); 
      r.selectNodeContents(e); 
      s.removeAllRanges(); 
      s.addRange(r); 
   } 
   else if (document.selection) 
   { 
      var r = document.body.createTextRange(); 
      r.moveToElementText(e); 
      r.select(); 
   } 
} 
</script>
';

   function remove_javascript($java){
   echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/i', "", $java);

   }    
?>
Michael Currie
  • 13,721
  • 9
  • 42
  • 58
Saxtor
  • 449
  • 1
  • 4
  • 15
  • 3
    I think better use some proper libraries to kill those – YOU Dec 11 '09 at 09:19
  • not working getting the same thing – Saxtor Dec 11 '09 at 09:20
  • 1
    If you are trying to prevent XSS, I think you should read this page http://ha.ckers.org/xss.html before you try something useless. There are a lot of methods to inject scripts. – Arkh Dec 11 '09 at 10:50
  • 1
    @Arkh is absolutely right. I don't know if this was meant to provide some level of XSS safety but it doesn't. Consider the trivial input `t>alert(1337)`. It matches the inner empty script tag, but removing that leaves a new script tag intact. To say nothing of scripts in URLs, event handlers, CSS, etc. – Mike Samuel Nov 30 '10 at 13:16

8 Answers8

66

this should do it:

echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $var);

/s is so that the dot . matches newlines too.

Just a warning, you should not use this type of regexp to sanitize user input for a website. There is just too many ways to get around it. For sanitizing use something like the http://htmlpurifier.org/ library

Tjofras
  • 2,116
  • 12
  • 13
  • I think this does not cover the case mentioned before, which is exactly what someone who tried to bypass such a check would do. – Dimitrios Mistriotis Dec 11 '09 at 10:57
  • Will a browser really run something inside ``? I find that hard to believe... – gnud Dec 11 '09 at 21:51
  • 8
    I have changed/improved it slightly (especially to match any optional whitespace in the tag, which browsers would ignore, too): `$html = preg_replace('~<\s*\bscript\b[^>]*>(.*?)<\s*\/\s*script\s*>~is', '', $html);` – blueyed Oct 20 '10 at 10:41
  • hey gnude, when they update the database to strip out all comments, yes, yes it will... lol. :) – Shanimal Jun 14 '14 at 05:49
  • Don't forget the easy workaround ``. Removing only script tags will still leave you vulnerable to injected javascript – Kyborek Oct 30 '19 at 14:48
  • if your document has many ` – vanduc1102 Jul 16 '20 at 15:54
3

This might do more than you want, but depending on your situation you might want to look at strip_tags.

deceze
  • 510,633
  • 85
  • 743
  • 889
2

Here's an idea

while (true) {
  if ($beginning = strpos($var,"<script")) {
    $stringLength = (strpos($var,"</script>") + strlen("</script>")) - $beginning;
    substr_replace($var, "", $beginning, $stringLength);
  } else {
    break
  }
}
Szymon Wygnański
  • 10,642
  • 6
  • 31
  • 44
bng44270
  • 339
  • 2
  • 6
1
function clean_jscode($script_str) {
    $script_str = htmlspecialchars_decode($script_str);
    $search_arr = array('<script', '</script>');
    $script_str = str_ireplace($search_arr, $search_arr, $script_str);
    $split_arr = explode('<script', $script_str);
    $remove_jscode_arr = array();
    foreach($split_arr as $key => $val) {
        $newarr = explode('</script>', $split_arr[$key]);
        $remove_jscode_arr[] = ($key == 0) ? $newarr[0] : $newarr[1];
    }
    return implode('', $remove_jscode_arr);
}
Donald Duck
  • 8,409
  • 22
  • 75
  • 99
1

In your case you could regard the string as a list of newline delimited strings and remove the lines containing the script tags(first & second to last) and you wouldn't even need regular expressions.

Though if what you are trying to do is preventing XSS it might not be sufficient to only remove script tags.

tosh
  • 5,222
  • 2
  • 28
  • 34
  • well thanks for the advice however what i am doing is creating an ripper so that was needed in my class code thank you guys! – Saxtor Dec 11 '09 at 10:09
1

You can remove any JavaScript code from HTML string with the help of following PHP function

You can read more about it here: https://mradeveloper.com/blog/remove-javascript-from-html-with-php

function sanitizeInput($inputP)
{
    $spaceDelimiter = "#BLANKSPACE#";
    $newLineDelimiter = "#NEWLNE#";
                                
    $inputArray = [];
    $minifiedSanitized = '';
    $unMinifiedSanitized = '';
    $sanitizedInput = [];
    $returnData = [];
    $returnType = "string";

    if($inputP === null) return null;
    if($inputP === false) return false;
    if(is_array($inputP) && sizeof($inputP) <= 0) return [];

    if(is_array($inputP))
    {
        $inputArray = $inputP;
        $returnType = "array";
    }
    else
    {
        $inputArray[] = $inputP;
        $returnType = "string";
    }

    foreach($inputArray as $input)
    {
        $minified = str_replace(" ",$spaceDelimiter,$input);
        $minified = str_replace("\n",$newLineDelimiter,$minified);

        //removing <script> tags
        $minifiedSanitized = preg_replace("/[<][^<]*script.*[>].*[<].*[\/].*script*[>]/i","",$minified);

        $unMinifiedSanitized = str_replace($spaceDelimiter," ",$minifiedSanitized);
        $unMinifiedSanitized = str_replace($newLineDelimiter,"\n",$unMinifiedSanitized);

        //removing inline js events
        $unMinifiedSanitized = preg_replace("/([ ]on[a-zA-Z0-9_-]{1,}=\".*\")|([ ]on[a-zA-Z0-9_-]{1,}='.*')|([ ]on[a-zA-Z0-9_-]{1,}=.*[.].*)/","",$unMinifiedSanitized);

        //removing inline js
        $unMinifiedSanitized = preg_replace("/([ ]href.*=\".*javascript:.*\")|([ ]href.*='.*javascript:.*')|([ ]href.*=.*javascript:.*)/i","",$unMinifiedSanitized);

                                    
        $sanitizedInput[] = $unMinifiedSanitized;
    }

    if($returnType == "string" && sizeof($sanitizedInput) > 0)
    {
        $returnData = $sanitizedInput[0];
    }
    else
    {
        $returnData = $sanitizedInput;
    }
                                
    return $returnData;
}
           
Ali Haider
  • 41
  • 4
0

this was very usefull for me. try this code.

while(($pos = stripos($content,"<script"))!==false){
    $end_pos = stripos($content,"</script>");
    $start = substr($content, 0, $pos);
    $end = substr($content, $end_pos+strlen("</script>"));
    $content = $start.$end;
}
$text = strip_tags($content);
Pejman Kheyri
  • 4,044
  • 9
  • 32
  • 39
-1

I use this:

function clear_text($s) {
    $do = true;
    while ($do) {
        $start = stripos($s,'<script');
        $stop = stripos($s,'</script>');
        if ((is_numeric($start))&&(is_numeric($stop))) {
            $s = substr($s,0,$start).substr($s,($stop+strlen('</script>')));
        } else {
            $do = false;
        }
    }
    return trim($s);
}
Adam Trhon
  • 2,915
  • 1
  • 20
  • 51
Tamás Pap
  • 17,777
  • 15
  • 70
  • 102