Remove all javascript from page

Question

I have a web page with control, that render user's HTML markup. I want remove all JS calls (and CSS, I guess) to prevent users from injecting malware code. Replacing all script tags and all onclick with others handlers seems to be a bad idea, so questin is about the best solution for this XSS problem in .Net world.

Injections issues are mostly comon when using forms that impact a database. If your web page doesn't communicate with your database, what are the risks of injection ? — Anwar, Jun 29 '15 at 14:19
possible duplicate of [How do I filter all HTML tags except a certain whitelist?](http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist) — David Arno, Jun 29 '15 at 14:22
@Zeratops, see http://stackoverflow.com/questions/2779926/is-it-possible-to-make-xss-attacks-through-html-comments-with-jsp-code-inside, or search on cross-site scripting in general, for some details of the risks. — David Arno, Jun 29 '15 at 14:24
@DavidArno and how it can help me? I already filter some tags `@"(?!<\s*/?\s*(b|i|u|s|strong|em|strike|del|sup|sub|br\s*/?|a|a\shref=""[^""]+"")\s*>)<[^>]+>";`, but don't understand how it can be applied here. Replace all known javascript-handlers? Of course I can do it and i wrote it in original post, but I guessed that maybe some build it method exists, like `Jsoup.clean` in Java — Alex Zhukovskiy, Jun 29 '15 at 14:28
@AlexZhukovskiy, let's assume you want to allow `bold text` but want to stop `bold text`, then you can do a regex replace of `` with ``, ie just strip all the parameters out of the tags. — David Arno, Jun 29 '15 at 14:32

PhonicUK · Accepted Answer · 2015-06-29T15:19:01.810

1

I'd strongly suggest not going down the regex route (You can't parse HTML with Regex), and consider something like HTMLAgilityPack.

This would allow you to remove all script elements, as well as remove all event handlers from elements regardless of how they're set up.

The alternative is to escape all HTML input, and then manually parse the particular tags you're interested in.

<b>Hello</b>

Becomes

&lt;b&gt;Hello&lt;/&gt;

And you can then match <(b|i|u|p|em|othertagsgohere)>(.+?)</$1> so that it will only match tags with no attributes on them of the types that you're interested in and. But ultimately I think the HTMLAgiltiyPack route is the better one.

edited Jun 29 '15 at 15:19

answered Jun 29 '15 at 14:36

PhonicUK

13,486
4
43
62

Unfortunly, I cannot use 3rd-party libs. I disagree with this position and do not want process tree structures by regex, but have no choise. So forgive me, please. – Alex Zhukovskiy Jun 29 '15 at 15:05
Why not? It's an open source library so it can be vetted if necessary. The other option would also be to use MSs own anti-xss library, but again that's a library. – PhonicUK Jun 29 '15 at 15:16
You know, extra 500kb library is impossibru heavy solution when you can do it by regex. Omg, you say, I was offered to build a state machine and use IndexOf due to speed, so there is always a pit deeper than previous. So never say "It just can't be worse" – Alex Zhukovskiy Jun 29 '15 at 15:29
You can't parse HTML with Regex, at best you can manage a small subset of it. 500KB is *tiny*, unless you're running on some insane embedded machine there's no reason not to have it. – PhonicUK Jun 29 '15 at 15:35

Remove all javascript from page

1 Answers1