2

I'm doing a forum like web app. Users are allowed to submit rich html text to server such as p tag, div tag, etc. In order to keep the format, server will write these tags back to the users' browser directly(without html encoded). So, I must do a potential dangerous script check to avoid XSS. Any JavaScript code is supposed to be dangerous and not allowed. So, How to detect them or any other better solution?

dangerous example 1:

<script>alert('1')</script>

dangerous example 2:

<script src="..."></script>

dangerous example 3:

<a href="javascript:dangerousFunction();">click me</a>
guogangj
  • 2,275
  • 3
  • 27
  • 44
  • http://stackoverflow.com/a/21729561/7106750 This might help you @guogangj –  Dec 29 '16 at 03:01
  • Maybe checkout [this](http://stackoverflow.com/questions/15458876/check-if-a-string-is-html-or-not ) Try to get that but in JS – Alex Munoz Dec 29 '16 at 03:08
  • Only allow a certain subset of tags, e.g., `

    `, `

    `, ``, ``, etc.; remove all other tags.
    – royhowie Dec 29 '16 at 03:23
  • 1
    What you are trying to do is called "sanitizing". Please google for that. You will find lots of libraries etc., that you can either use as is, or borrow from. –  Dec 29 '16 at 03:24

2 Answers2

2

Use an HTML Parser

Your requirements are straightforward:

  • You must disallow all <script> tags, but keep certain rich HTML tags.
  • You must be able to escape inline Javascript in links. i.e. stringify it or strip the unsafe attributes altogether.

The correct way to handle all of these is to employ a modern standards-compliant HTML parser that is able to syntactically analyse the structure of the rich HTML sent over, identifying the tags sent over and discovering the raw values in attributes. This is, in fact, how sanitisation, as one of the comments mentions, is done.

There are a number of pre-existing HTML parsers that are designed to target XSS-unsafe input. The npm library js-xss, for example, appears to be able to do exactly what you want:

You can even run this server-side as a command line utility.

Similar libraries already exist for most languages, and you should do a thorough search of your preferred language's package repository. Alternatively, you can launch a subprocess and collect your results directly from js-xss from the command line.

Avoid using regular expressions to parse HTML naively - while it is true most HTML parsers end up using regular expressions under the hood, they do so in a fairly limited fashion for strictly well-defined grammars after correctly lexing them.

Community
  • 1
  • 1
Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
-3

Use this regex

<script([^'"]|"(\\.|[^"\\])*"|'(\\.|[^'\\])*')*?<\/script>

for detecting all types of <script> tag

but I suggest using a iframe in sandbox mode to show ALL html code, by doing that you prevent javascript code from being able to do anything bad.

http://www.w3schools.com/tags/att_iframe_sandbox.asp

I hope this helps!

herohamp
  • 615
  • 4
  • 13