Remove complete HTML tag when characters are found

Question

A string contains a HTML tag with a word + suffix (in this case ...rem)

Example:

<b>SomeText...rem</b>
<u>SomeText...rem</u>
<strong>SomeText...rem</strong>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

When the word inside the HTML Tag contain

...rem

The complete HTML Tag + word should be removed.

I can rename "...rem". Its only a marker.

Is this possible?

you can use [`:contains()`](https://api.jquery.com/contains-selector/) — Pranav C Balan, Feb 06 '16 at 16:47
will the structure be nested or flat? i.e. is this valid -
blah blah some text ...rem — IaMaCuP, Feb 06 '16 at 16:48
@lamaCuP the structure "can be" nested. When its nested, remove complete tag in your example from
to — labu77, Feb 06 '16 at 16:51
Is "...rem" always at the end, or can it occur anywhere in the string? — Josh Crozier, Feb 06 '16 at 17:06
I gave up - http://pastebin.com/uKmuZnWZ is where i got to, basically the problem with DOMDocument is the loadHTML method closes the tag on the edge of the document - not when it sees the next tag (which is what i was hopeing for) - meaning the translated dom will remove most of itself. Sorry @user2057781 — IaMaCuP, Feb 06 '16 at 17:26
Regex won't work reliably, especially when self-closed tags or improperly balanced tags are present (if not, you have a chance). Otherwise, you'll have to parse the HTML, build an element tree and prune the branches that have `...rem` as their final element. — Tim Pietzcker, Feb 06 '16 at 17:39
I just noticed in your examples that you have mismatched tags - is that actually the case, or just a copy/paste error? — Tim Pietzcker, Feb 06 '16 at 17:43
I thought Pranav's suggestion was nuts, but the more I think about it the more it makes sense. — Pete, Feb 06 '16 at 19:10

score 1 · Accepted Answer · edited May 23 '17 at 12:23

I would strongly suggest using an HTML parser for this. However, since your question asks for a regular expression, you could use the following and replace the matches in a callback.

/(?s)<(\w+)[^>]*>(.*?)<\/\1>/

Explanation:

(?s) - s flag so that the . character also matches newlines characters.
<(\w+)[^>]*> - Match an opening HTML tag and capture the element name
(.*?) - Second capturing group to match the contents of the HTML tag
<\/\1> - Match the closing HTML tag by using a back reference based on the first capturing group (which is the tag name).

Then use the function preg_replace_callback in order to replace the match with an empty sting if the second capturing group contains the substring ...rem. Otherwise, do nothing by replacing the match with itself.

Live Example Here

preg_replace_callback('/(?s)<(\w+)[^>]*>(.*?)<\/\1>/', function ($m) {
  return strpos($m[2], '...rem') !== false ? '' : $m[0];
}, $string);

Thank you very much for your help. Its almost good but 1 thing. When SomeText...rem Underline Underline its fine. But when Underline Underline SomeText...rem thre regex remove every . I need a regex that remove only the html tag with the ...rem and not every. Hope this is possible? — labu77, Feb 07 '16 at 09:30
thank you very much. This works perfect! Smart and awesome work (IMHO) --> job done! btw - inconsistent closing tags fixed! — labu77, Feb 08 '16 at 11:37

score 0 · Answer 2 · 2016-02-09T08:21:06.327

Thought I'd take a shot.
Using PHP, here is an exact way to do it.

update version

This uses the \K construct so there is no need to write back
Tracker data to string. Just replace with nothing.
Also gains speed doing it this way.

Formatted and tested:

 # ** Usage **
 # -----------------
 # Find: ''~(?s)(?:(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*\K(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\2\s*)>)))|.*?(?:(?&RawContent)|(?&Comment))\K)(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\2)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: nothing

 # Dot-all modifier
 (?s)

 # Single group, two alternatives.

 (?:
      # Alternative 1 (highest priority)
      # =================================

      # This is the bactracker. This is crucial !
      # We go all the way up until we find
      # the raw content we are looking for,
      # or comments (because they could hide tags).
      # Then we backtrack from there to 
      # find the closest inner open/close tags
      # that contain our content.

      # Tracker1 - Formerly captured, was the replacements
      (?:
           (?&Comment)? 
           (?!
                (?&RawContent) 
             |  (?&Comment) 
           )
           . 
      )*

      # Prevent Tracker1 need to write back
      \K 

      # Conditional Assertion -
      # Have we reached the end of string without 
      # finding the tagged Content ?

      (?(?= \z )
           # ---------------------------------------------
           # Yes -  Don't do anything, the remainder is in
           # Tracker1 and is thrown away.
           # ---------------------------------------------

        |  
           # ---------------------------------------------
           # No - Find the tagged Content.
           # If no match, Tracker1 will backtrack 1 char and retry.
           # Here, Tracker1 will find up to the point
           # of the tagged Content and be consumed, but thrown away.
           # ---------------------------------------------

           # Get Target Open tag
           (?<OpenTag>                         # (1)
                (?>
                     <
                     (?:
                          (?<TagName> [\w:]+ )                # (2), tag name
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
                (?<! /> )
           )

           # Get Body containing the raw content   
           (?<Body>                            # (3)

                # Stuff before raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?

                # The raw content we need
                (?= . )
                (?&RawContent)                       

                # Stuff after raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?
           )

           # Get Target Close tag
           (?<CloseTag>                        # (4)
                (?>
                     <
                     (?: / \2 \s* )
                     >
                )
           )
      )
   |  
      # Alternative 2 (lowest priority)
      # =================================

      # Here, we've already backtracked all
      # possibilities from Tracker1.
      # At this point, we have raw content, 
      # or comments that we must get past.
      # Comments because they could hide tags.
      # Just take it off, it will be thrown away.

      # Tracker2 - Formerly captured, was the replacements
      .*? 
      (?:
           (?&RawContent) 
        |  (?&Comment) 
      )

      # Prevent Tracker2 need to write back
      \K 
 )



 # Functions
 # -----------------------
 (?(DEFINE)

      (?<RawContent>                      # (5)

           # Raw content we are looking for.
           # Note - this is content and is not contained
           # in tags nor comments.

           \.\.\.rem                           # '...rem' or whatever
      )

      (?<Tag_Not_TargetOpen>              # (6)

           # Consume any tag that
           # is not the target Open tag.
           # Comsume comment as well.
           (?>
                <
                (?:
                     (?! \2 )
                     [\w:]+ 
                     (?: " .*? " | ' .*? ' | [^>]*? )+
                )
                >
             |  
                (?&Comment) 
           )
      )

      (?<Char_Not_Tag>                    # (7)

           # Consume any charater
           # that does not begin a tag or comment
           (?!
                (?>
                     <
                     (?:
                          [\w:]+ 
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
             |  
                (?&Comment) 
           )
           .  
      )

      (?<Comment>                         # (8)

           # Comment
           (?>
                <
                (?:
                     !
                     (?:
                          (?: DOCTYPE .*? )
                       |  (?: \[CDATA\[ .*? \]\] )
                       |  (?: -- .*? -- )
                       |  (?: ATTLIST .*? )
                       |  (?: ENTITY .*? )
                       |  (?: ELEMENT .*? )
                     )
                )
                >
           )
      )
 )

Test case

Input:

<div>blah blah <i>some text</i> ...rem</div>
<b>SomeText...rem</b>
<u>SomeText...rem</b>
<strong>SomeText...rem</b>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

Output:

 **  Grp 0                      -  ( pos 0 , len 44 ) 
<div>blah blah <i>some text</i> ...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 0 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 1 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 5 , len 33 ) 
blah blah <i>some text</i> ...rem  
 **  Grp 4 [CloseTag]           -  ( pos 38 , len 6 ) 
</div>  

---------------------

 **  Grp 0                      -  ( pos 46 , len 21 ) 
<b>SomeText...rem</b>  
 **  Grp 1 [OpenTag]            -  ( pos 46 , len 3 ) 
<b>  
 **  Grp 2 [TagName]            -  ( pos 47 , len 1 ) 
b  
 **  Grp 3 [Body]               -  ( pos 49 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 63 , len 4 ) 
</b>  

---------------------

 **  Grp 0                      -  ( pos 86 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 70 , len 1 ) 
u  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 114 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 93 , len 6 ) 
strong  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 120 , len 30 ) 
<a href="/">SomeText...rem</a>  
 **  Grp 1 [OpenTag]            -  ( pos 120 , len 12 ) 
<a href="/">  
 **  Grp 2 [TagName]            -  ( pos 121 , len 1 ) 
a  
 **  Grp 3 [Body]               -  ( pos 132 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 146 , len 4 ) 
</a>  

---------------------

 **  Grp 0                      -  ( pos 152 , len 25 ) 
<div>SomeText...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 152 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 153 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 157 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 171 , len 6 ) 
</div>

Previous version with Tracker write back.

 # ** Usage **
 # -----------------
 # Find: '~(?s)(?:(?<Tracker1>(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*)(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\3\s*)>)))|(?<Tracker2>.*?(?:(?&RawContent)|(?&Comment))))(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\3)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: '$1$6'

Remove complete HTML tag when characters are found

2 Answers2