3

I'm using a series of regex patterns to remove HTML elements from my code. I need to also remove the style="{stuff}" attributes that are also present in the file.

At the moment I have style.*?, which matches only the word style, however I thought that by adding .*? to the regex it would also match with zero to unlimited characters after the style declaration?

I also have style={0,1}"{0,1}.*?"{0,1} which matches:

style=""
style="
style

But does not match style="something", again in this regex I would expect the .*? to match everything between the first " and the second ", but this is not the case. What do I need to do to change this regex so that it will match with all of the following:

style="font-family:"Open Sans", Arial, sans-serif;background-color:rgb(255, 255, 255);display:inline !important;"
style=""
style="something" 
style
Racil Hilan
  • 24,690
  • 13
  • 50
  • 55
Jake12342134
  • 1,539
  • 1
  • 18
  • 45
  • If what you're trying to match ends at the end of the line, you may use `style.*` or `style.*?$`. If not, you may use something like `style(?:=".*?")?`. – 41686d6564 stands w. Palestine Sep 18 '19 at 12:05
  • Is this special HTML? Because for valid HTML, the `style` attribute must have a value, so it cannot be without an equal sign `style` or with empty quotes `style=""`. Also the quotes can be double or single. – Racil Hilan Sep 18 '19 at 12:07
  • @AhmedAbdelhameed I want it to stop after the second quote, unfortunately your regexes continue to infinity. – Jake12342134 Sep 18 '19 at 12:49
  • Just matching the word "style" itself does seem pretty dangerous... in fact I'd enforce that it be found inside <> tags, personally. – Nyerguds Sep 18 '19 at 12:50
  • https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – Jan Christoph Uhde Sep 22 '19 at 11:48

2 Answers2

6

The pattern style.*? does not match the following parts as there is nothing following the non greedy part so it is matching as least as possible.

You could use an optional group and a negated character class:

\bstyle(?:="[^"]*")?

In parts

  • \bstyle Word bounary, match style
  • (?: Non capturing group
  • )? Close group and make it optional

Regex demo

If you want to match single or double quotes with the accompanying closing single or double quote to not match for example style="', you could use a capturing group (["']) with a backreference \1 to what was captured in group 1:

\bstyle(?:=(["'])[^"]*\1)?

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • So your first regex is close to what I need, except if there is no space between the final `"` and the closing `>` then it doesn't match. I need to make sure both `` and `` can match – Jake12342134 Sep 18 '19 at 13:04
  • 1
    It'll also match the word in the plain sentence `I go out in style!`, though... chance of damaging your text content. – Nyerguds Sep 18 '19 at 13:32
  • 1
    @Nyerguds That is correct, if you really want a good approach you should use a parser. Note that your answer is also prone to side effects. See [demo](http://regexstorm.net/tester?p=%28%3f%3c%3d%3c%5ba-zA-Z%5d%5b%5e%3c%3e%5d*%3f%29%5cs*%5cbstyle%28%3f%3a%3d%22%5b%5e%22%5d*%22%29%3f%28%3f%3d%5b%5cs%3e%5d%29%28%3f%3d%5b%5e%3c%3e%5d*%3e%29&i=%3cspan+id%3d%22myid%22+style%3d%22color%3a+red%3b%22+data-test%3d%22%3cspan%3eHi%3c%2fspan%3e%22%3eText%3c%2fspan%3e) – The fourth bird Sep 18 '19 at 13:47
  • I don't see how it failing on _actually broken HTML_ would be undesirable behaviour. – Nyerguds Sep 18 '19 at 13:50
  • 1
    @Nyerguds With side effect I mean that your pattern can also fail if you don't expect it to. It is html stored inside a data attribute as an example. Perhaps this page will be helpful https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – The fourth bird Sep 18 '19 at 14:00
  • 1
    As I said, that's plain illegal in html. It should at least be html-encoded to `data-test=">span<Hi>/span<"`. And I'm well aware of the Parable of Tony The Pony. – Nyerguds Sep 18 '19 at 14:02
0

Here's what I cooked up. It uses positive lookbehind (?<=...) and lookahead (?=...) to ensure that the found match is inside an HTML tag:

(?<=<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])(?=[^<>]*>)

Test it out.

It will match any whitespace before the "style", so that a removal of all matches goes from <a stuff="..." style="width:18px;" href="someurl"> to <a stuff="..." href="someurl"> without leaving a double space behind where it was removed.

Note that some regex parsers (like the Python one) don't like lookbehind with non-fixed size. This can be solved simply by changing the first and last parts, the lookbehind and lookahead groups, into capture groups instead, thereby capturing the whole html tag. Then you simply need to replace the match by $1$2 instead of an empty string, replacing the found match by the same thing but without the style="..." part inside it.

The resulting regex for that would be:

(<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])([^<>]*>)

Test it out.

Nyerguds
  • 5,360
  • 1
  • 31
  • 63
  • you get some unexpected results if there's a `>` in an attribute, e.g., `` – BurnsBA Sep 18 '19 at 14:05
  • @BurnsBA Such data is illegal inside a tag. It should be encoded to `>`. – Nyerguds Sep 19 '19 at 21:20
  • Also note, this answer (1) incorrectly strips out tags like `data-style`, (2) corrupts non-html like `<a else="" if="" style="example&gt;&lt;/xmp&gt;` and (3) potentially breaks javascript, in situations like `if (a&lt;b) { $(" xyz="">c) { ... }`</a> – BurnsBA Sep 20 '19 at 14:28
  • @BurnsBA What's your point? So do the other solutions suggested here. This one at least has _some_ additional safeguards. I changed the \b to \s to remove the possibility of matching properties like "data-style". – Nyerguds Sep 20 '19 at 17:14