0

I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.

The task is replacing all the style='xxx' with an empty string, except within tables.

This regex for preg_replace works catching all style='xxx' no matter where:

'/ style="([^"]+)"/s'

The content can look like this

<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->

or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.

Is there a simple syntax doing this?

SQB
  • 3,926
  • 2
  • 28
  • 49
Email
  • 2,395
  • 3
  • 35
  • 63

2 Answers2

1

Thou Shalt Not Parse HTML with Regular Expressions!


No, really, you shouldn't.

As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".

Community
  • 1
  • 1
SQB
  • 3,926
  • 2
  • 28
  • 49
  • no the regex does not have to keep track of nested tables. There is the right modifier to be used and look only for my given regex which is not placed within . like declared i am aware of the +/- using regex for doing this. thx anyway
    – Email Feb 20 '14 at 10:34
  • @Email Not the point; you're trying to use the wrong tool for the job. – David Ehrmann May 20 '14 at 01:09
0

Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.

First we need a regex to match tables, nested or not. This does it with simple recursion:

<table(?:.*?(?R).*?|.*?)</table>

Next, we exclude these, and match what we do want. Here is the whole regex:

(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1

See the demo

The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.

With this regex, you can do a simple preg_replace($regex, "", $yourstring);

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105