1

hey guys, I'm a regexp noob, Is it possible with preg_replace to remove a the an entire paragraph tag?

<p><div class="vidwrapper"> lot of content with oder divs etc. </div><p>

The paragraph should only be removed if it is following div has a class of .vidwrapper.

Is that even possible? Any idea how this regexp would look like? Thank you for your help.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
matt
  • 42,713
  • 103
  • 264
  • 397
  • You _can_ do it with regex, but there will always be corner cases you cannot handle (at least, not within "reasonable" bounds). Take the following string where regex might chock on: `

    lot of content with oder divs etc.

    `. Trying to fix many of these little corner cases with a (or multiple) regex(-es) will most likely result in a hideous, error-prone and hard to maintain solution.

    – Bart Kiers Feb 19 '11 at 14:12
  • there will never be another

    inside of it! It's always the same structure inside of this div! there are just other divs inside of it and a

    – matt Feb 19 '11 at 16:57
  • shouldn't it be something like this: `$para = "/

    (.*?)<\/div><\/p>/smix";`
    – matt Feb 19 '11 at 17:14
  • I could write the regex for you if I could understand what you are trying to do. Can you give some concrete examples of before and after? –  Feb 19 '11 at 17:22
  • well it's rather simple! I have a some complicated custom html that includes a youtube-player with a swfobject. This complicated html is in every post of my blog the same and is always wrapped inside this `

    ...` in every RSS-item. I'm found a already a regexp that would strip of just the script-tag inside of it (swf-object javascript) and that works fine!
    – matt Feb 19 '11 at 17:38
  • However if I strip of the javascript inside there is also absolutely no need to have the empty wrapper and the

    around it inside of my RSS feeds. I simply need to get rid of that! And there will be no need to make any exceptions or difficult special cases in this regex because the `

    – matt Feb 19 '11 at 17:40

3 Answers3

1

It's a bad idea to do this using a regex, unless you know that there will be no paragraph (or anything that might superficially be interpreted as a paragraph) inside of the vidwrapper.

If you don't, writing a regex for something like this will be very hard:

<p><div class="vidwrapper"> Hello there. <p>Wee.</p> Yoink. </div></p>
<p><div class="vidwrapper"> Hello there. <!-- <p>Wee.</p> --> Yoink. </div></p>

An easier (and more robust) way would probably be to parse the HTML with an HTML parser, and do a search on the DOM tree instead.

See also:

Community
  • 1
  • 1
Sebastian Paaske Tørholm
  • 49,493
  • 11
  • 100
  • 118
1

If it's a fixed occurrence, then following might work:

preg_replace('#<p>[^<]*<div[^>]+class="vidwrapper"[^>]*>.*?</p>#is', "")

For matching nested html you would normally need a recursing regex, hencewhy something like phpQuery or QueryPath is then often simpler:

$html = pq($html)->find("p div.vidwrapper")->parent()->remove()->html();
mario
  • 144,265
  • 20
  • 237
  • 291
0

If you think the script will cause problems, you can use this as well.

#
 \s*
 <p\s*> \s* <div \s+ class \s* = \s* (["']) vidwrapper \1 \s* >
 (?:
      <script (?:\s+ (?:".*?"|'.*?'|[^>]*?)+)? \s*>
      .*?
      </script\s*>)
   |  .
 )*?
 </p\s*>
#xs