0

I'm trying to write an php function with preg_replace that removes all inline attributes of html elements, but wanted to leave some like 'href', 'title', 'alt'.

What I got until now is

([\w\-.:]+)\s*=\s*("[^"]*"|'[^']*'|[\w\-.:]+)

for marking all inline elements, but it still takes text like

href="test" Test

Without any html around it, additionally, this takes all inline attributes. See my example text here:

[https://regex101.com/r/3OVaO2/1][1]

The goal is to remove any dangerous html elements. I know that I have to handle something for the href-attribute in an extra function.

sneaky
  • 439
  • 1
  • 6
  • 18
  • 1
    I assume, by inline elements, you mean attributes of the tags? – Christoph Herold Mar 12 '19 at 16:46
  • 2
    Regular expressions aren't very appropriate for this task; you need a proper HTML parser. Two things you need to be aware of: [attribute values are not always quoted](https://www.w3.org/TR/html5/syntax.html#unquoted) and they [can contain line breaks](https://stackoverflow.com/q/22831988/6002174). – Tsundoku Mar 12 '19 at 16:50
  • 3
    See also [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747/6002174) – Tsundoku Mar 12 '19 at 16:51
  • 3
    elements ≠ attributes – Sean Mar 12 '19 at 16:51
  • 1
    You said "wanted to leave some like 'href'" and "The goal is to remove any dangerous html elements" — But `href` is a dangerous attribute and can be used to trigger XSS attacks. – Quentin Mar 12 '19 at 17:09
  • 1
    What are you actually trying to achieve? If you are building a web application to display some content, of which you don't know whether it is safe or not, you can simply display the content in an iframe, on which you can apply the sandbox attribute (https://caniuse.com/#feat=iframe-sandbox). This will remove privileges on the browser level, which is probably always safer than what you can do. – Christoph Herold Mar 12 '19 at 17:22
  • Thanks for all the suggestions, I added more and hopefully now the corrected information. I know that I have to handle href with something extra. I just want to remove attributes like "onmouseover" and so on. (But with a whitelist, the goal is to leave just href, title and alt) – sneaky Mar 12 '19 at 20:48

1 Answers1

1

As already mentioned in the comments, Regex is not the way to go here.

That said: I have come up with this (https://regex101.com/r/3OVaO2/2)

(<\w+\s*[^>]*)\s(?!href|title|alt)[\w\-\d]+=(?:(['"]).*?\2|\w+)

However, this will only remove ONE evil attribute. The problem is, that with PCRE, you cannot have variable length lookbehind assertions. If you switch it to ECMAscript, you can do this (https://regex101.com/r/3OVaO2/3)

(?<=<\w+\s*[^>]*)\s(?!href|title|alt)[\w\-\d]+=(?:(['"]).*?\1|\w+)

This will probably do, what you want it to do. Nonetheless, this is NOT the holy grail for sanitizing HTML. Be careful with your output, if you don't consider your input safe.

Also, the definition of the tags may need some tweaking, since there may be tags like <some-element>, which are currently not detected by the regular expression.

sneaky
  • 439
  • 1
  • 6
  • 18
Christoph Herold
  • 1,799
  • 13
  • 18
  • Thanks, I know that this won't remove all problems, but I can use the first regex for removing the whole tag, that is enough for me. I found out that browsers still use the attributes when there are whitespaces around the equal signs, so I get this regex: `(<\w+\s*[^>]*)\s(?!href|title|alt)[\w\-\d]+\s*=\s*(?:(['"]).*?\2|\w+)` I know that I still have to handle the href tag (and removing there things like "javascript:"). I'm thinking of using the html purifier later on. – sneaky Mar 13 '19 at 08:39