how to remove a tag and its contents using regular expression?

Question

$str = 'some text tag contents more text ';

My questions are: How to retrieve content tag <em>contents </em> which is between <MY_TAG> .. </MY_TAG>?

And

How to remove <MY_TAG> and its contents from $str?

I am using PHP.

Thank you.

I wonder how many times the following answer is linked in any given day: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Nicole, Mar 04 '10 at 18:22

score 13 · Answer 1 · answered Aug 20 '13 at 17:39

For removal I ended up just using this:

$str = preg_replace('~<MY_TAG(.*?)</MY_TAG>~Usi', "", $str);

Using ~ instead of / for the delimiter solved errors being thrown because of the backslash in the end tag, which seemed to be an issue even with escaping. Eliminating > from the opening tag allows for attributes or other characters and still gets the tag and all of its contents.

This only works where nesting is not a concern.

The Usi modifiers mean U = Ungreedy, s = include linebreak characters, i = case insensitive.

good job (y) work fine for span e.g $ptitle = preg_replace('~~Usi', "", $ptitleWithSpan); — Hassan Saeed, Jan 05 '17 at 16:51

score 12 · Accepted Answer · answered Mar 04 '10 at 18:22

12

If MY_TAG can not be nested, try this to get the matches:

preg_match_all('/<MY_TAG>(.*?)<\/MY_TAG>/s', $str, $matches)

And to remove them, use preg_replace instead.

answered Mar 04 '10 at 18:22

Gumbo

643,351
109
780
844

1

@user187580: The *s* flag makes the `.` match line breaks. See http://php.net/manual/en/reference.pcre.pattern.modifiers.php – Gumbo Mar 04 '10 at 18:33
You had better set ungreedy with this pattern if you may find this tag in the string more than once. Otherwise you'll find that you convert this string "This is a very important set line" into "This is line" – Don Jan 18 '16 at 18:44
@Don The `?` after the `*` does exactly the same. – Gumbo Jan 18 '16 at 20:29
And I looked right at this answer and did not see the ? modifier, whoops! – Don Jan 19 '16 at 00:48

score 2 · Answer 3 · answered Mar 04 '10 at 23:00

You do not want to use regular expressions for this. A much better solution would be to load your contents into a DOMDocument and work on it using the DOM tree and standard DOM methods:

$document = new DOMDocument();
$document->loadXML('<root/>');
$document->documentElement->appendChild(
    $document->createFragment($myTextWithTags));

$MY_TAGs = $document->getElementsByTagName('MY_TAG');
foreach($MY_TAGs as $MY_TAG)
{
    $xmlContent = $document->saveXML($MY_TAG);
    /* work on $xmlContent here */

    /* as a further example: */
    $ems = $MY_TAG->getElementsByTagName('em');
    foreach($ems as $em)
    {
        $emphazisedText = $em->nodeValue;
        /* do your operations here */
    }
}

Nicole · Answer 4 · 2010-03-04T18:35:48.837

1

Although the only fully correct way to do this is not to use regular expressions, you can get what you want if you accept it won't handle all special cases:

preg_match("/<em[^>]*?>.*?</em>/i", $str, $match);
// Use this only if you aren't worried about nested tags.
// It will handle tags with attributes

And

preg_replace(""/<MY_TAG[^>]*?>.*?</MY_TAG>/i", "", $str);

edited Mar 04 '10 at 18:35

answered Mar 04 '10 at 18:29

Nicole

32,841
11
75
101

score 1 · Answer 5 · answered Jan 02 '21 at 05:48

I tested this function, it works for nested tags too, use true/false to exclude/include your tags. Found here: https://www.php.net/manual/en/function.strip-tags.php

<?php
function strip_tags_content($text, $tags = '', $invert = FALSE) {

  preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
  $tags = array_unique($tags[1]);
   
  if(is_array($tags) AND count($tags) > 0) {
    if($invert == FALSE) {
      return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
    }
    else {
      return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
    }
  }
  elseif($invert == FALSE) {
    return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
  }
  return $text;
}




// Sample text:
$text = '<b>sample</b> text with <div>tags</div>';

// Result for:
echo strip_tags_content($text);
// text with

// Result for:
echo strip_tags_content($text, '<b>');
// <b>sample</b> text with

// Result for:
echo strip_tags_content($text, '<b>', TRUE);
// text with <div>tags</div>

how to remove a tag and its contents using regular expression?

5 Answers5

Linked