regexp to remove entire paragraph based on it's content?

Question

hey guys, I'm a regexp noob, Is it possible with preg_replace to remove a the an entire paragraph tag?

<p><div class="vidwrapper"> lot of content with oder divs etc. </div><p>

The paragraph should only be removed if it is following div has a class of .vidwrapper.

Is that even possible? Any idea how this regexp would look like? Thank you for your help.

You _can_ do it with regex, but there will always be corner cases you cannot handle (at least, not within "reasonable" bounds). Take the following string where regex might chock on: `
lot of content with oder divs etc.
`. Trying to fix many of these little corner cases with a (or multiple) regex(-es) will most likely result in a hideous, error-prone and hard to maintain solution. — Bart Kiers, Feb 19 '11 at 14:12
there will never be another
inside of it! It's always the same structure inside of this div! there are just other divs inside of it and a — matt, Feb 19 '11 at 16:57
shouldn't it be something like this: `$para = "/
(.*?)<\/div><\/p>/smix";` — matt, Feb 19 '11 at 17:14
I could write the regex for you if I could understand what you are trying to do. Can you give some concrete examples of before and after? — , Feb 19 '11 at 17:22
well it's rather simple! I have a some complicated custom html that includes a youtube-player with a swfobject. This complicated html is in every post of my blog the same and is always wrapped inside this `
...` in every RSS-item. I'm found a already a regexp that would strip of just the script-tag inside of it (swf-object javascript) and that works fine! — matt, Feb 19 '11 at 17:38
However if I strip of the javascript inside there is also absolutely no need to have the empty wrapper and the
around it inside of my RSS feeds. I simply need to get rid of that! And there will be no need to make any exceptions or difficult special cases in this regex because the ` — matt, Feb 19 '11 at 17:40

score 1 · Answer 1 · edited May 23 '17 at 10:33

It's a bad idea to do this using a regex, unless you know that there will be no paragraph (or anything that might superficially be interpreted as a paragraph) inside of the vidwrapper.

If you don't, writing a regex for something like this will be very hard:

<p><div class="vidwrapper"> Hello there. <p>Wee.</p> Yoink. </div></p>

<p><div class="vidwrapper"> Hello there. <!-- <p>Wee.</p> --> Yoink. </div></p>

An easier (and more robust) way would probably be to parse the HTML with an HTML parser, and do a search on the DOM tree instead.

regexp to remove entire paragraph based on it's content?

3 Answers3

See also: