74

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

I-M-JM
  • 15,732
  • 26
  • 77
  • 103
  • 3
    Keep in mind that script tags are not the only vulnerable parts of HTML. – Karolis Aug 20 '11 at 09:22
  • Yes, I know about other vulnerable parts too, but I just need to remove script tags – I-M-JM Aug 20 '11 at 09:24
  • 3
    Read [this](http://www.pagecolumn.com/tool/all_about_html_tags.htm). It will help you – Jose Adrian Aug 20 '11 at 09:28
  • 4
    @Jose hell no. read this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 no regex for parsing html – Madara's Ghost Aug 20 '11 at 09:47
  • This question was already asked many times e.g. [here](http://stackoverflow.com/questions/116403/im-looking-for-a-regular-expression-to-remove-a-given-xhtml-tag-from-a-string/116488#116488) or [here](http://stackoverflow.com/questions/226562/how-can-i-remove-an-entire-html-tag-and-its-contents-by-its-class-using-a-regex/226591#226591), but beware of [that](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege/702222#702222). – dma_k Aug 20 '11 at 10:00
  • 1
    @Rikudo Well... if he needs to use regexp to remove html tags... there should be a reason. Thanks for that link! – Jose Adrian Aug 20 '11 at 10:03
  • @Jose the reason is not being familier with other better tools. It's the exact same reason people are still using `mysql_*` funtions in php. – Madara's Ghost Aug 20 '11 at 10:06
  • @Rikudo Sennin -- or PHP at all. :) – Michael Lorton Aug 20 '11 at 10:07
  • @Malvolio nahhh, that's going a bit too far now :P – Madara's Ghost Aug 20 '11 at 10:08
  • @Rikudo Using regex for html parsing has it's own advantages and disadvantages. Its usefulness depends on particular situation. Don't be so fanatic. The world is much more complex and the same rule can't be used for all purposes. Yes, in many cases regex is not the best tool for HTML parsing, but this doesn't mean anything. – Karolis Aug 20 '11 at 10:16
  • Obviously, however, in most cases, it's very inefficient and insecure to use a regex. It's very problematic to use a parser that **does not understand** the language its parsing. That's why there are **specific** HTML and XML parsers. – Madara's Ghost Aug 20 '11 at 10:18
  • @Rikudo You are trying to use one rule for everything :) Latter you'll see that not everything is so simple. – Karolis Aug 20 '11 at 10:25
  • Regarding the html parser vs. regex debate - you probably need both; be aware that an html parser will not recognize conditional comments which means that IE will happily render script tags therein. The general problem with solving this in an elegant way is that the browsers don't care... – jgivoni Jan 18 '13 at 15:20

13 Answers13

162

Because this question is tagged with I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

Dejan Marjanović
  • 19,244
  • 7
  • 52
  • 66
  • 13
    -1 for RegExp solution. See [this discussion](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Alex Aug 20 '11 at 10:20
  • 55
    I saw that discussion long time ago, you should read it, not just see it. – Dejan Marjanović Aug 20 '11 at 10:23
  • 12
    While I appreciate your aloof response, my reasoning for disapproving your answer is sound. See [this gist](https://gist.github.com/42e3f06274d5df814cf1) for a crafted script tag which circumvents your regex. In fairness, it is arguably more of shortcoming of your particular regular expression than a reason to abandon regex altogether. But, interesting to me all the same. – Alex Dec 07 '11 at 23:53
  • 2
    This particular regex is _vulnerable_ to javascript injection. – jmlnik Mar 31 '12 at 05:01
  • 1
    @ParijatKalia it's a stupid idea to display remote HTML with or without script anyways, what difference does it makes? If you are absolutely sure about the content, I doubt you'll run into a HTML like you've written. Btw, I answered with regex only because the questions was tagged like so. – Dejan Marjanović Apr 23 '13 at 19:59
  • 5
    If you want to take the regex route, make sure you run `prey_replace` multiple times until the output doesn't change anymore (catches example input from @ParijatKalia). – Mark Aug 22 '13 at 12:22
  • 1
    Just out of interest why do you have two `foreach`loops? Why not just `foreach($scripts as $script){$script->parentNode->removeChild($script);}`? – Arth Dec 16 '14 at 16:56
  • 3
    @Arth because you will not get correct results (iterator doesn't behave like it's expected), see [this](http://php.net/manual/en/domnode.removechild.php#90292) comment. – Dejan Marjanović Dec 16 '14 at 21:20
  • @webarto Thanks for your reply, particularly the ref! – Arth Dec 17 '14 at 10:55
  • why is the #is for on the regex? – Arnold Roa Dec 18 '14 at 18:23
  • 1
    For sake of argument. Sometimes it IS necessary to use regex to strip tags from content. Sure, we all know this is bad but sometimes you HAVE to use regex. The DOMDocument will not work unless it is HTML. But let's say you are importing content from Drupal to WordPress... DOMDocument will not work as this is not true HTML in the content but just text with markup in it. This is when you HAVE to use regex as you want to keep most tags but remove script tags as they shouldn't be there anyways. So sure, use DOMDocument if you can but to say you shouldn't use regex to do this is just ignorant. – Jeremy Feb 09 '15 at 19:27
  • You regexp haters are acting like DOMDocument is safer. It's not. – jchook Mar 17 '16 at 03:28
  • 1
    how do you get the DOMDocument parser to not add the Doctype, HTML and BODY tags? – Mike Jun 17 '16 at 20:02
  • Thanks for the answer, but I second Mike comment above. If I'm working with an HTML snippet, I wouldn't appreciate to have other stuff added around like saveHTML apparently does. – DrLightman Nov 03 '16 at 15:24
  • In the regex solution i think you should escape `/` in ` – Kyborek Nov 25 '16 at 09:10
  • To avoid adding DOCTYPE, html and body tags, see [this answer](https://stackoverflow.com/a/31426408/3832970). – Wiktor Stribiżew Oct 30 '17 at 07:30
  • `'~~is'` – Эдуард Mar 26 '18 at 14:45
  • 1
    Note that this breaks DOMDocument parsing when using loadHTML() because of the HTML markup in a Javascript string: ```
    ```
    – Matthew Kolb Sep 28 '18 at 18:57
  • saveHtml() will add extra unnecessary html to the string ie:

    for more info see https://3v4l.org/1TNHP

    – relipse Jan 02 '20 at 21:20
  • What about `` uppercased or mixed tags? – Zsolt Janes Apr 08 '20 at 23:54
  • The DOMDocument solution does not work for me, it puts the

    inside the

    tag, thus messing up the whole html

    –  Jun 17 '20 at 11:17
44

Use the PHP DOMDocument parser.

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

Community
  • 1
  • 1
Alex
  • 9,313
  • 1
  • 39
  • 44
  • 5
    +0 I'm sick of hearing about that discussion regarding regex and HTML. In _some_ very special occasions it should be OK to use regex. In my case, I'm getting this error: `Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag myCustomTag invalid in Entity`. Tried everything. All I want to do is remove script tags for one tiny part of the application (_without_ spending any more time on it). I'm going to use preg_replace and that is that. I don't wanna hear another word about it. :) – Yes Barry Dec 06 '11 at 21:39
  • 2
    See my comment to the chosen best answer. I would prefer to see coders cover general cases, as malicious users can get very clever. However, you are right: in developing an internal application, for instance, it could be considered OK to ignore such vulnerabilities and use regex. – Alex Dec 07 '11 at 23:55
  • @Xeoncross Thanks! I'll give that a try next time I get a chance to work on this. At the moment I'm busy with other code and don't wanna have to dig that stuff up :). – Yes Barry Feb 10 '12 at 03:22
  • 1
    DOMDocument and SimpleXML can be used to load files outside of your document root. Use libxml_disable_entity_loader(true) to disable this feature of libxml. http://www.php.net/manual/en/function.libxml-disable-entity-loader.php – txyoji Jul 19 '12 at 20:19
  • this code will give `'Fatal error: Call to a member function removeChild() on null'` once you have an empty tag, like `` – yumba Jul 01 '15 at 13:36
  • @Spi Interesting. Do you know how to amend the code to fix that? – Alex Jul 02 '15 at 11:30
  • @SPi I kept getting the same errors. This worked for me (still, used yours as a base, so thanks...): `// load HTML $dom = new DOMDocument; $dom->loadHTML($html_to_parse); // remove all scripts while (true) { $script = $dom->getElementsByTagName('script')->item(0); if ($script != NULL) { $script->parentNode->removeChild($script); } else { break; } }` – Paul Sep 07 '16 at 14:23
  • Note that this breaks DOMDocument parsing when using loadHTML() because of the HTML markup in a Javascript string: ```
    ```
    – Matthew Kolb Sep 28 '18 at 18:57
  • Thanks for the update @MatthewKolb. Shame it doesn't work anymore (what PHP version are you using?); do you know if there's something more appropriate? – Alex Sep 30 '18 at 09:09
  • @Alex I'm using php 5.6.35. Your example still works great - as long as the JS does not include HTML tags. I've read loadXML() would better be able to handle this type of case, but it appears it just fails to load the DOM at all since it considers the input to be invalid XML. I haven't found a better solution than to use REGEX to strip scripts before loading into DOMDocument – Matthew Kolb Oct 01 '18 at 14:17
7
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();
prasanthnv
  • 159
  • 2
  • 2
  • I upvoted this response because for one thing it's clean and simple, and it also reminded me that iframes could also cause me trouble. – soger Dec 06 '18 at 14:44
  • 1
    Also, I just realized, this adds doctype, html and body tags, which is okay for the current question, but was not okay for me, but I only had to change one line (as the top comment says on the saveHTML php.net page): `$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);` – soger Dec 06 '18 at 14:59
4

A simple way by manipulating string.

function stripStr($str, $ini, $fin)
{
    while (($pos = mb_stripos($str, $ini)) !== false) {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos);
        
        if (($pos2 = mb_stripos($aux, $fin)) !== false) {
            $str .= mb_substr($aux, $pos2 + mb_strlen($fin));
        }
    }

    return $str;
}
José Carlos PHP
  • 1,417
  • 1
  • 13
  • 20
  • @Someone_who_likes_SE Yes, sure. You can use stripos and substr instead of mb_stripos and mb_substr, but I prefer to use MB functions, they are more reliable. – José Carlos PHP Jul 22 '21 at 23:59
  • This is all fine, but there is a serious flaw here. Mind you, you do not know which input you have. If $fin not in $str (or $aux), you have a perfect loop here. Happy debugging! There are several options to tweak this code to cope for that flaw. I'll leave it to you to fix it. – kklepper Aug 16 '21 at 19:29
  • @kklepper I have modified it, now if $fin is not found, it cuts from $ini to the end of the string. Regards! – José Carlos PHP Aug 18 '21 at 08:46
3

Shorter:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

Binh WPO
  • 114
  • 1
  • 7
3
  • this is a merge of both ClandestineCoder & Binh WPO.

the problem with the script tag arrows is that they can have more than one variant

ex. (< = &lt; = &amp;lt;) & ( > = &gt; = &amp;gt;)

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

ctf0
  • 6,991
  • 5
  • 37
  • 46
3

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:


/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */

/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */

if (!function_exists('removeAllInstancesOfTag'))
    {
        function removeAllInstancesOfTag($html, $tag_nm)
            {
                if (!empty($html))
                    {
                        $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
                        $doc = new DOMDocument();
                        $doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);

                        if (!empty($tag_nm))
                            {
                                if (is_array($tag_nm))
                                    {
                                        $tag_nms = $tag_nm;
                                        unset($tag_nm);

                                        foreach ($tag_nms as $tag_nm)
                                            {
                                                $rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
                                                $rmvbl_itms_arr = [];

                                                foreach ($rmvbl_itms as $itm)
                                                    {
                                                        $rmvbl_itms_arr[] = $itm;
                                                    }

                                                foreach ($rmvbl_itms_arr as $itm)
                                                    {
                                                        $itm->parentNode->removeChild($itm);
                                                    }
                                            }
                                    }
                                else if (is_string($tag_nm))
                                    {
                                        $rmvbl_itms = $doc->getElementsByTagName($tag_nm);
                                        $rmvbl_itms_arr = [];

                                        foreach ($rmvbl_itms as $itm)
                                            {
                                                $rmvbl_itms_arr[] = $itm;
                                            }

                                        foreach ($rmvbl_itms_arr as $itm)
                                            {
                                                $itm->parentNode->removeChild($itm); 
                                            }
                                    }
                            }

                        return $doc->saveHTML();
                    }
                else
                    {
                        return '';
                    }
            }
    }

/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */

/* Prerequisites: 'removeAllInstancesOfTag(...)' */

if (!function_exists('removeAllScriptTags'))
    {
        function removeAllScriptTags($html)
            {
                return removeAllInstancesOfTag($html, 'script');
            }
    }

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */


And here is a test usage example:


$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);

I hope my answer really helps someone. Enjoy!

NoobishPro
  • 2,539
  • 1
  • 12
  • 23
James Anderson Jr.
  • 760
  • 1
  • 8
  • 26
2

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;  

If you are using php 7 you can use the null coalesce operator to simplify it even more.

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str); 
tech-e
  • 434
  • 5
  • 15
  • This does have one down fall which is if someone uses files from a script folder in the html like: .. . This will create a catch that will delete everything in between them. – tech-e Mar 24 '17 at 18:05
2
function remove_script_tags($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $script = $dom->getElementsByTagName('script');

    $remove = [];
    foreach($script as $item){
        $remove[] = $item;
    }

    foreach ($remove as $item){
        $item->parentNode->removeChild($item);
    }

    $html = $dom->saveHTML();
    $html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
    $html = str_replace('</p></body></html>', '', $html);
    return $html;
}

Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

relipse
  • 1,730
  • 1
  • 18
  • 24
1

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

  • You should not use Regex for finding script tags in the HTML code. Use DOMDocument to parse the entire document and find the script tags to remove – Nicola Revelant Sep 02 '21 at 08:27
1

This is a simplified variant of Dejan Marjanovic's answer:

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

$scriptlessHtml = removeTags($html, 'script');
mae
  • 14,947
  • 8
  • 32
  • 47
1

use the str_replace function to replace them with empty space or something

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

?>

Oliver Tembo
  • 305
  • 3
  • 6
  • I don't know why people keep arguing over DOMDocument and some kind of regex as the "solution" vs "not the solution". I like this guy's answer -- to simply use php's str_replace (but I'd use str_ireplace due to case-insensitivity). Unless you have a ton of stuff you want to remove, this seems to be the simplest and most effective solution. I tell my users that can't paste or type that kind of stuff. If they do, then tough luck -- it will be removed. – McAuley Oct 27 '18 at 20:16
  • 2
    This solution keeps javascript code inside the html string. This is a joke, not a good solution! However, you can go far and remove from " – José Carlos PHP Oct 31 '18 at 12:06
  • **i**replace " – NeoTechni May 28 '21 at 22:05
1

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

Don't try to do it with regexps. That way lies madness.

Michael Lorton
  • 43,060
  • 26
  • 103
  • 144