5

I am doing stack overflow style adding images & formatting to posts[as possible via the stackoverflow post edit tools], so I have the generated html for the presentation that could be used to display on pages.

But the problem is how to display that html, when I try to display the html it gets printed on the page like "<html>blah bhlah</html>". How to escape this html content safely on my webpages ?

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Rajat Gupta
  • 25,853
  • 63
  • 179
  • 294
  • This page has some answers you're looking for http://stackoverflow.com/q/6234773/3529744 – one stevy boi Apr 18 '14 at 22:50
  • 1
    Do you want to print the HTML source (so it looks like `

    Foo

    ` or do you want to include the HTML in the page (so that the abobe would be presented as a heading "Foo")? It's not clear
    – Adam Apr 18 '14 at 22:50

2 Answers2

3

What's happening in your case is that the HTML is being escaped and is therefore rendered as text.

I don't know what language you are writing in but I suspect you used the built-in text escape function. This will render the HTML as text however, this will not make it safe.

I suspect that what you are looking for is a solution that will:

  1. Parse the HTML and sanitize it to remove any potentially malicious tags such as JavaScript, external references, iframes etc.
  2. Store this sanitized HTML.
  3. Render the input as part of the page.

StackExchange supports only a sctrict subset of HTML, you may want to emulate the approach taken.

This is not a simple problem to solve and you will most likely want to find some framework that will do this for you than rolling your own.

For example, some exploits that someone may want to attempt against your system:

  • Additional </div> tags to escape the wrapping element.
  • Some character combination that may not look like valid HTML but behaves as such anyway.
  • Utilizing some Javascript that you already have on your page.
  • Adding CSS to break the page layout.
Community
  • 1
  • 1
Vasily Sliounaiev
  • 463
  • 1
  • 5
  • 16
  • Note that it's better to store verbatim user input and sanitize it during use. This allows *fixing* any errors in the sanitizer. Of course, it also requires that you *correctly encode* the incoming data so that you can safely store it verbatim. Note that all encoding requires knowledge of the *context* to correctly encode. For example, incoming data needs different encoding for column name vs string value in SQL. And neither of those encodings is same as one needed for e.g. HTML attribute data. Also beware of data formats than can enclose other formats. For example, HTML->IFRAME->CSS->SVG. – Mikko Rantalainen Jun 28 '21 at 11:14
1

It's a two step process. First you need to sanatize the input with a library like this; http://msdn.microsoft.com/en-us/security/aa973814.aspx . It will remove script tags and other sneaky things people could try to do something malicious.

Then you need to display the raw output. With Asp.Net MVC it's @Html.Raw(x=>x.SomePropertyThatIsHtml). If your using something else it should have an equivilant to prevent it from being encoded.

Honorable Chow
  • 3,097
  • 3
  • 22
  • 22