1

I wanted to remove the nested tags formed after various DOM manipulations from the html string so that the end html string after the manipulation looks clean but reflects the correct behaviour. I am using the regular expression to select the content inside the main tag and replace the nested tags with "". The problem is I could not get the regular expression to select the content inside the main tag.

Example html string: <em>a<em>bbb<strong>ccc<em>ddddd</em></strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>

In this instance I am only focused with strong tag although there is nested em tags in the beginning. The reason is that the regular expression that matches the content inside the em tag does not matches the desired content inside the strong tag.

Desired selection: fff<em>gg</em>hh<strong>iii</strong>jjj

The above selection is desired due to the strong tag being present inside the main strong tag. The first strong tag i.e. <strong>ccc<em>ddddd</em></strong> is ignored as it is contained inside the em tag. I only want the content if the string has a nested tag of the same type.

I wrote a few regular expressions but the closest I could get was by using a regular expression: /(?<=<strong>)(?!\w*<\/strong>).*?<strong>.*?<\/strong>.*?(?=<\/strong>)/g.

But this will work if the closing strong tag has only word characters before it. I mean this works on the following string: <em>a<em>bbb<strong>ccc</strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>.

But this does not work on the string: <em>a<em>bbb<strong>ccc<em>ddddd</em></strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>. It is obvious that the reason is due to the presence of non word characters before the closing strong tag. So, I tried to replace \w* with .*? to match any character before the closing strong tag, but this did not work.

Bikash K C
  • 13
  • 3
  • How many levels of nesting do you wish to handle? – Armali Feb 21 '23 at 08:43
  • 1
    And it just [never stops](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)... – trincot Feb 21 '23 at 09:45
  • The number of levels can vary. What I wanted to achieve is get all the elements and/or content inside the main tag. So the number of sub or nested tags can change but won't pose any problem as the nested tags would be replaced with "" when the content inside the main tag is extracted as a string. – Bikash K C Feb 22 '23 at 23:08
  • How many levels of nesting do you wish to handle at most? – Armali Feb 23 '23 at 15:40
  • 1
    At most, handling 2 levels of nesting would suffice but according to my use case handling even one level like in the example would be ok because as soon as the nesting is seen it would be taken care of so there would be no other levels of nesting present. For clarity, let's say only one level of nesting like in the example is the level of nesting I need to handle. – Bikash K C Feb 25 '23 at 00:51

1 Answers1

0

… the closest I could get was by using a regular expression: /(?<=<strong>)(?!\w*<\/strong>).*?<strong>.*?<\/strong>.*?(?=<\/strong>)/g.

… only one level of nesting like in the example is the level of nesting I need to handle.

The part in your expression that doesn't work right is (?!\w*<\/strong>).*?. We want to bar a closing strong tag herein; this can be achieved by replacing that part with ((?!<\/strong>).)*.

for (x of ['<em>a<em>bbb<strong>ccc<em>ddddd</em></strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>',
           '<em>a<em>bbb<strong>ccc</strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>',
           '<em>a<em>bbb<strong>ccc<em>ddddd</em></strong>eeee</em></em><strong>fff<em>gg</em>hh<strong>iii</strong>jjj</strong>'])
    console.log(x.match(/(?<=<strong>)((?!<\/strong>).)*<strong>.*?<\/strong>.*?(?=<\/strong>)/g))
Armali
  • 18,255
  • 14
  • 57
  • 171