How to safely extract text content from arbitrary html

Question

I have some user generated html which I don't have control over;

I want to extract just the text (textContent, innerText, whatever) from this html chunk to display on a website.

How can I safely grab the text, considering this html content could have malicious code like script tags, iframes, style tags or some other stuff like that.

This is an input example:

<p style="text-align:center;"><em>whatever</em></p>
<style>body { display: none } </style>

<p><em>Some more whatever</em></p>
<script>alert('lala')</script>

And this is what I'm expecting back:

whatever

some more whatever

From what I understand, the solution should not append things to DOM, as it could potentially increase chances of a XSS attack. Using a whitelist/blacklist is fine but not ideal because it's hard to maintain (come up with) and keep updated.

Parse the string and make sure it doesn't have unwanted stuff in there, like `script` tags, and such, and if it does, reject it with an error. — Ryan Wilson, Feb 01 '19 at 16:50
Don't trust the client with this. Let your server handle it. — grooveplex, Feb 01 '19 at 16:52
Why is the example code at the question not an option? Prospective "malicious code" would be a string. — guest271314, Feb 01 '19 at 16:53
Why is the provided snippet an option? Scripts are not executed before the newly-created div wil be appended to the DOM. — Teemu, Feb 01 '19 at 16:53
@enapupe Why is parsing not an option? Obviously you are able to change or modify the example code snippet. Just add that into what you have. — Ryan Wilson, Feb 01 '19 at 16:54
@Teemu interesting, so if I don't append the div to DOM the script will be safely ignored? — enapupe, Feb 01 '19 at 16:54
@enapupe Why would the extracted text be appended to the DOM? — guest271314, Feb 01 '19 at 16:55
@enapupe There's nothing complex about `arbitraryHTML.indexOf(' — Ryan Wilson, Feb 01 '19 at 16:57
Yes, althought you've to remove scripts from the div, since the actual script content is included in the `textContent`. — Teemu, Feb 01 '19 at 16:58
The problem is, it's not JUST about scripts, there could be iframes and other stuff on which you can't really be on top of. — enapupe, Feb 01 '19 at 16:59
@enapupe Same concept different tag, `arbitraryHTML.indexOf('') !== -1)`, or you could make a `regular expression` and see if any matches occur. — Ryan Wilson, Feb 01 '19 at 17:00
It doesn't matter, any of the tags aren't put in the DOM before you really append the temporary div into it. — Teemu, Feb 01 '19 at 17:00
@Teemu He doesn't want to append it to the DOM if it has scripts or iframes, etc....Hence my suggestion to parse the string first before appending. — Ryan Wilson, Feb 01 '19 at 17:01
@RyanWilson As far as I can see, they don't want to append it at all, they just want to get the text content ..? — Teemu, Feb 01 '19 at 17:02
@Teemu Sorry, you may be right, I just don't understand the purpose of creating a `div` if you aren't going to append it. — Ryan Wilson, Feb 01 '19 at 17:03
@RyanWilson That's just the idea of the temporal elements. They are safe to use, since they are not parsed to the DOM, but you can use built-in HTML parsers to get some content from such a temporal element. — Teemu, Feb 01 '19 at 17:04
The last paragraph at the edited question has no bearing on the actual question. How is appending anything to the DOM relevant to the original question? What do you mean by _"to display on a website"_? In a `` element? — guest271314, Feb 01 '19 at 17:23
@Teemu I removed my answer, as I don't think it sounds viable with all of the different encodings you'd need to account for. Thanks for the insight, Teemu :) — Ryan Wilson, Feb 01 '19 at 17:30

score 1 · Answer 1 · answered Feb 01 '19 at 17:05

1

You can use *:not() selector to get all elements and exclude script elements

const arbitraryHTML = `<p style="text-align:center;"><em>whatever</em></p>

<p><em>Some more whatever</em></p>
<script>alert('lala')<\/script>`;

function getTextFromHTML(arbitraryHTML){
  var a = document.createElement('div')
  a.innerHTML = arbitraryHTML;
  // exclude `script` elements at selector string
  return [...a.querySelectorAll('*:not(script)')]
         // filter nodes that do not have `firstElementChild`
         .filter(({firstElementChild})=> !firstElementChild)
         // return `textContent`
         .map(({textContent}) => textContent)
}

console.log(getTextFromHTML(arbitraryHTML))

answered Feb 01 '19 at 17:05

guest271314

1
15
104
177

Looks promising! Trying to think ways to exploit this approach. – enapupe Feb 01 '19 at 17:11
@enapupe What do you mean by "exploit"? If `script` elements are not intended to be included in the resulting `NodeList` they can be excluded using `:not()` pseudo class selector. What are you ultimately trying to determine and achieve? – guest271314 Feb 01 '19 at 17:14
Since this html code will come from the user and I don't have control over it, I want to be sure one can't exploit it in any way. The most obvious one is adding a script tag, but you could also do malicious stuff by adding img tag with specific src or maybe an iframe. a – enapupe Feb 01 '19 at 17:17
Also, I'd prefer a solution on which I don't have to keep adding elements to the exclusion list. As I mentioned on the previous comment, `style` element is another that could be found on the arbitrary html and would mess up the text content (i think) – enapupe Feb 01 '19 at 17:18
@enapupe None of those extraneous concerns are described at the actual question. If the requirement is to get the `.textContent` of the elements excluding ` – guest271314 Feb 01 '19 at 17:19
Ok I'll edit my question but I should be clear that I want LEGIT text content, nothing else and the html input could be anything. – enapupe Feb 01 '19 at 17:20
@enapupe That is what the code at the answer achieves, without addressing what is meant by _"LEGIT"_. What you _do_ or _do not do_ with the resulting _strings_ is not relevant to the inquiry. – guest271314 Feb 01 '19 at 17:21
Your code is really fine. The only con, as I mentioned before is the need to keep the blacklist updated and complete. For instance, if you add a style element to the html input it will fail the goal. – enapupe Feb 01 '19 at 17:24
@enapupe You are ex post facto adding requirements to the original question. You can exclude any elements from the selector. If the goal is _"to display on a website."_ the `.textContent` of elements, for example, at a `` `.value` it does not matter if `<script>` and `<style>` elements are included or excluded from the text to be displayed.</script> – guest271314 Feb 01 '19 at 17:26
I'm editing the question as I become aware of new constraints/limitations. I'm trying to extract some really good solution from this question, not an immediate solution. Did you get what I'm going for? Otherwise I can try to improve the question. – enapupe Feb 01 '19 at 17:30
1

@enapupe If the actual goal is to _"to display on a website."_ you can use a `` element, where removing elements from an HTML string would be moot, as the `.value` of a [`<textarea>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/textarea) element is a string _" The HTML <textarea> element represents a multi-line plain-text editing control"_, not executable code. – guest271314 Feb 01 '19 at 17:32
Not sure how that textarea thing would work. What I can tell is that your code looks **really good** and adding some extra tags to the blacklist could definitely do it. I'm still open new approaches though as I would PREFER NOT TO maintain such blacklist. – enapupe Feb 01 '19 at 17:36
@enapupe _"Not sure how that textarea thing would work."_ ? The edited question states that the ultimate goal is to extract `.textContent` of an HTML string _"to display on a website."_. There is no need for a "blacklist" if that is the case. A `` `.value` is plain text, not executable code. – guest271314 Feb 01 '19 at 17:37
I don't get how a textarea would filter html markup and get me just the text content. Send me an answer if you think it's appropriate solution. – enapupe Feb 01 '19 at 17:39
@enapupe The code at the answer provides a template to filter elements which are to be excluded from the result and returns only the `.textContent` of the elements that do not have child elements. The original question only mentioned `script` elements. If you want to add other elements which should be excluded you are free to do so. Is the actual requirement something other than _"to display on a website."_? – guest271314 Feb 01 '19 at 17:41
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187756/discussion-between-enapupe-and-guest271314). – enapupe Feb 01 '19 at 17:42

score 0 · Answer 2 · answered Feb 01 '19 at 16:59

0

If you use the innerText property instead of textContent then the content of any <script> tags will not be returned.

answered Feb 01 '19 at 16:59

Dan Nagle

4,384
1
16
28

It does return the script content. – enapupe Feb 01 '19 at 17:00

How to safely extract text content from arbitrary html

2 Answers2