I have some user generated html which I don't have control over;
I want to extract just the text (textContent
, innerText
, whatever) from this html chunk to display on a website.
How can I safely grab the text, considering this html content could have malicious code like script tags, iframes, style tags or some other stuff like that.
This is an input example:
<p style="text-align:center;"><em>whatever</em></p>
<style>body { display: none } </style>
<p><em>Some more whatever</em></p>
<script>alert('lala')</script>
And this is what I'm expecting back:
whatever
some more whatever
From what I understand, the solution should not append things to DOM, as it could potentially increase chances of a XSS attack. Using a whitelist/blacklist is fine but not ideal because it's hard to maintain (come up with) and keep updated.