20

I have the following html:

<html>
 <body>
 bla bla bla bla
  <div id="myDiv"> 
         more text
      <div id="anotherDiv">
           And even more text
      </div>
  </div>

  bla bla bla
 </body>
</html>

I want to remove everything starting from <div id="anotherDiv"> until its closing <div>. How do I do that?

user229044
  • 232,980
  • 40
  • 330
  • 338
rockstardev
  • 13,479
  • 39
  • 164
  • 296
  • There seems to be an edit war on this page. Please clarify this Unclear question so that researchers can benefit. – mickmackusa Nov 22 '19 at 22:00
  • There is a big difference between removing a single, specific element versus removing all tags with a specific tagname. – mickmackusa Nov 22 '19 at 22:15
  • Every regex solution to this question is incorrect, for any interpretation of this question, and will fail in surprising ways on many different inputs. You need a DOM parser, as the accepted answer uses. Whether you thought the question wanted to strip a `
    `, or strip an element by its ID, neither option can be accomplished correctly with a regular expression.
    – user229044 Nov 28 '19 at 03:12
  • Consider stripping `
    ` (by tag or by ID) from `
    ` with a regex. Or `
    `, or any other number of simple cases that will break a regex-based solution.
    – user229044 Nov 28 '19 at 03:35

7 Answers7

34

With native DOM

$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//*[@id="anotherDiv"]');
if($nodes->item(0)) {
    $nodes->item(0)->parentNode->removeChild($nodes->item(0));
}
echo $dom->saveHTML();
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • what i have to modify if i want to remove all div tag in a dom? – Sisir Nov 19 '11 at 08:51
  • @Sisir see http://stackoverflow.com/questions/4177376/delete-all-elements-of-a-certain-type-from-an-xml-doc-using-php/4177407#4177407 – Gordon Nov 19 '11 at 09:10
  • 1
    yes this works a treat. Ive always wante dto be able to remove an html tag form a string of html much like a jquery $(selector#id).remove(). This is just brilliant! – azzy81 Mar 09 '12 at 07:50
  • @SubstanceD if you want selectors check out [phpQuery, Zend_Dom or QueryPath](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php/3577662#3577662). Personally, I prefer [XPath](http://schlitt.info/opensource/blog/0704_xpath.html). – Gordon Mar 09 '12 at 08:43
14

You can use preg_replace() like:

$string = preg_replace('/<div id="someid"[^>]+\>/i', "", $string);
Florent
  • 12,310
  • 10
  • 49
  • 58
Haim Evgi
  • 123,187
  • 45
  • 217
  • 223
  • 1
    this will remove all `div`s and not only the specified one. – jigfox Jul 22 '10 at 12:11
  • You don't specify anywhere that it must remove the div with the ID=myDiv? – rockstardev Jul 22 '10 at 12:11
  • @HaimEvgi Is there any way to remove the content inner? for example using p tags it'll be removed, but the content of the p tags remains. – avolquez Nov 29 '12 at 19:29
  • this rocks, but is there anyway to remove the closing tag? – hakazvaka Apr 22 '13 at 15:10
  • Here is a simple way to strip specific tags(both open & closing): https://gist.github.com/tedicela/0b06265eefb8df41cb8256bb3f442916 – Tedi Çela Dec 09 '16 at 14:44
  • 1
    This answer DEFINITELY doesn't do what the OP requires. 16 UVs means that lots of researchers have been misinformed and don't understand the question and/or what this answer does. This answer does far more harm than good. The overarching message should be that developers should use a dom parser to manipulate valid html. – mickmackusa Nov 21 '19 at 21:43
  • 1
    Question says: _I want to remove everything starting from
    until its closing
    . How do I do that?_ **This answer is incorrect.**
    – mickmackusa Nov 21 '19 at 21:46
  • This is incorrect and fails for `
    `. You cannot use a regex for this.
    – user229044 Nov 28 '19 at 03:45
5

Using the native XML Manipulation Library

Assuming that your html content is stored in the variable $html:

$html='<html>
 <body>
 bla bla bla bla
  <div id="myDiv"> 
         more text
      <div id="anotherDiv">
           And even more text
      </div>
  </div>

  bla bla bla
 </body>
</html>';

To delete the tag by ID use the following code:

    $dom=new DOMDocument;

    $dom->validateOnParse = false;

    $dom->loadHTML( $html );

    // get the tag

    $div = $dom->getElementById('anotherDiv');

   // delete the tag

    if( $div && $div->nodeType==XML_ELEMENT_NODE ){

        $div->parentNode->removeChild( $div );
    }

    echo $dom->saveHTML();

Note that certain versions of libxml require a doctype to be present in order to use the getElementById method.

In that case you can prepend $html with <!doctype>

$html = '<!doctype>' . $html;

Alternatively, as suggested by Gordon's answer, you can use DOMXPath to find the element using the xpath:

$dom=new DOMDocument;

$dom->validateOnParse = false;

$dom->loadHTML( $html );

$xp=new DOMXPath( $dom );

$col = $xp->query( '//div[ @id="anotherDiv" ]' );

if( !empty( $col ) ){

    foreach( $col as $node ){

        $node->parentNode->removeChild( $node );

    }

}

echo $dom->saveHTML();

The first method works regardless the tag. If you want to use the second method with the same id but a different tag, let say form, simply replace //div in //div[ @id="anotherDiv" ] by '//form'

RafaSashi
  • 16,483
  • 8
  • 84
  • 94
0

strip_tags() function is what you are looking for.

http://us.php.net/manual/en/function.strip-tags.php

ItsPronounced
  • 5,475
  • 13
  • 47
  • 86
  • 4
    trip_tags() doesn’t work the way he want it to. strip_tags() allows for certain exclusions, but why would you use that when you only want to exclude one tag and include all other tags – Haim Evgi Jul 22 '10 at 12:02
  • From his question, I couldn't really tell what tags he was trying to remove. It seemed as if he wanted to remove everything. Thanks for the input. – ItsPronounced Jul 22 '10 at 12:03
  • Ahhh, using chrome. His inline markup didn't show up. I just checked it in firefox and I see his inline markup. You are correct :) Any reason why it didn't show up in chrome? – ItsPronounced Jul 22 '10 at 12:06
  • strip_tags() worked best for me. Thanks. The reason it worked best for me is because i had tags that had no spaces. It was the easiest by far. thanks. – Alex Spencer Dec 19 '12 at 02:24
  • Question says: _I want to remove everything starting from
    until its closing
    . How do I do that?_ **This answer is incorrect.**
    – mickmackusa Nov 21 '19 at 21:48
-1

I wrote these to strip specific tags and attributes. Since they're regex they're not 100% guaranteed to work in all cases, but it was a fair tradeoff for me:

// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
    foreach ($tags as $tag) {
        $regex = '#<\s*' . $tag . '[^>]*>.*?<\s*/\s*'. $tag . '>#msi';
        $html = preg_replace($regex, '', $html);
    }
    return $html;
}

// Strips the given attributes found in the given HTML string.
function strip_attributes($html, $atts) {
    foreach ($atts as $att) {
        $regex = '#\b' . $att . '\b(\s*=\s*[\'"][^\'"]*[\'"])?(?=[^<]*>)#msi';
        $html = preg_replace($regex, '', $html);
    }
    return $html;
}
Aram Kocharyan
  • 20,165
  • 11
  • 81
  • 96
  • 1
    Regex is DOM-ignorant and is prone to failure. Using a legitimate DOM parsing technique will be more robust, reliable, and scalable. Iterated `preg_` calls is going to be inefficient. The `m` pattern modifier is of no use. – mickmackusa Nov 21 '19 at 21:50
  • 1
    This answer does not target the tag using the `id` as stated in the question. This answer is incorrect because it with remove elements that should not be removed. – mickmackusa Nov 21 '19 at 22:06
-1

how about this?

// Strips only the given tags in the given HTML string.
function strip_tags_blacklist($html, $tags) {
    $html = preg_replace('/<'. $tags .'\b[^>]*>(.*?)<\/'. $tags .'>/is', "", $html);
    return $html;
}
Community
  • 1
  • 1
Hoàng Vũ Tgtt
  • 1,863
  • 24
  • 8
  • 1
    Regex is DOM-ignorant and is prone to failure. Using a legitimate DOM parsing technique will be more robust, reliable, and scalable. There is no reason to declare `$html` (a single-use variable); just `return preg_replace(...);` This snippet will fail when a tag attribute value contains `>`. There is no need to use a capture group. – mickmackusa Nov 21 '19 at 21:53
  • This answer does not target the tag using the `id` as stated in the question. This answer is incorrect because it with remove elements that should not be removed. – mickmackusa Nov 21 '19 at 22:07
  • This is incorrect and fails for many kinds of input, for example `strip_tags_blacklist('
    foo
    ', 'div')` => `
    – user229044 Nov 28 '19 at 03:49
-1

Following RafaSashi's answer using preg_replace(), here's a version that works for a single tag or an array of tags:

/**
 * @param $str string
 * @param $tags string | array
 * @return string
 */

function strip_specific_tags ($str, $tags) {
  if (!is_array($tags)) { $tags = array($tags); }

  foreach ($tags as $tag) {
    $_str = preg_replace('/<\/' . $tag . '>/i', '', $str);
    if ($_str != $str) {
      $str = preg_replace('/<' . $tag . '[^>]*>/i', '', $_str);
    }
  }
  return $str;
}
  • 1
    Question says: _I want to remove everything starting from
    until its closing
    . How do I do that?_ **This answer is incorrect.**
    – mickmackusa Nov 21 '19 at 21:55
  • 1
    This answer does not target the tag using the `id` as stated in the question. This answer is incorrect because it with remove elements that should not be removed. – mickmackusa Nov 21 '19 at 22:07