2

I have a rich text area where the user can type something. I am trying to prevent JavaScript injection using the following regex:

return input == null ? null : input.replaceAll("(?i)<script.*?>.*?</script.*?>", "") // case 1
            .replaceAll("(?i)<.*?javascript:.*?>.*?</.*?>", "") // case 2
            .replaceAll("(?i)<.*?\\s+on.*?>.*?</.*?>", ""); // case 3

Above, input is the text from the rich text area and I am using this regex to avoid possible JavaScript injections.

The problem is case 3. If the user's text contains "on", all the text before "on" gets removed.

How can I make the last case more rigid and avoid the above problem?

Salem
  • 13,516
  • 4
  • 51
  • 70
user1631306
  • 4,350
  • 8
  • 39
  • 74

1 Answers1

1

If you want to remove "on" and everything up to the end of the tag, you can use this: .replaceAll("(?i)(<.?\s+)on.?(>.*?)", "$1$2");

This renders "ACD" as "ACD". But be aware that if someone puts a ">" character inside the script, it will mess up the regex...

EDIT: the moral of my remark is that I would not recommend a custom parsing to remove javascript code. I suggest you get yourself acquainted with the answer to the following question: Java: Best way to remove Javascript from HTML and probably use Jsoup.clean (if it is possible in your environment).

Community
  • 1
  • 1
Igor Deruga
  • 1,504
  • 1
  • 10
  • 18
  • JSoup removes the attributes from HTML. Does it also work with just plain text. Example: it doenst work on "I like this site because teaches me a lot" – user1631306 Jan 04 '17 at 19:27
  • It does accept just text... But it might do some stuff that you don't want: it removed tag completely (it should not be within the text) and it added a newline when I tried it with

    . Did you think about escaping the html (including javascript) instead of removing it?

    – Igor Deruga Jan 04 '17 at 19:35