How to get multiple nested open and close tag using regex?

Question

I am working on regex in node.js. I want to remove the string (quoted one)

***<bPoint id="1" >
     <bLabel>
       <text></text>
     </bLabel>
     <content src="p112" />***
     <bPoint id="2">
         <bLabel>
            <text>xxx</text>
         </bLabel>
          <content src="p1123" />
     </bPoint>
***</bPoint>***
<bPoint id="bPoint-2" >
    <bLabel>
          <text>xxx</text>
    </bLabel>
    <content src="p1123" />        
</bPoint>

That is - i want to remove

  <bPoint>...may inner bPoint tag also there   ..</bPoint>

Can any one assist to remove string from the above tag using regex?

Wouldn't it be better to use a parser, especially as it's not clear at all what you want to remove here ? — adeneo, Jun 05 '15 at 15:44
Not using regex would probably be the easiest way. What you have seems like XML, so parse it into a DOM, mutate it and serialize it back to XML. Obligatory link: http://stackoverflow.com/q/1732348/218196 — Felix Kling, Jun 05 '15 at 15:44
If you could parse CDATA and comments at the same time, and used a regex engine that supported recursion (PCRE,Perl,etc..) this might be doable, but probably a little complex for some people. — , Jun 05 '15 at 16:26
@sln: when recursion is not available, you can always remove innermost elements with several pass. — Casimir et Hippolyte, Jun 05 '15 at 17:06
@CasimiretHippolyte - That might be his only hope. You should post that. — , Jun 05 '15 at 17:09

score 0 · Answer 1 · answered Jun 13 '15 at 14:26

The following Perl regular expression search string with an empty replace string can be used in a text editor like UltraEdit if the XML file is well formatted with the elements on separate lines and correct indentations as in the example to delete all most outer bPoint elements from the file.

^([\t ]*)<bPoint.*?>[\s\S]+?\n\1</bPoint>[\t ]*(?:\r?\n|$)

UltraEdit has the command XML Convert to CR/LFs in menu Format to get a well formatted XML file.

Expression explanation:

^ ... start search at beginning of a line.

([\t ]*) ... find 0 or more tabs or spaces at beginning of a line and mark them for back referencing.

<bPoint.*?> ... find start tag of element bPoint.

[\s\S]+? ... find any character including line terminators 1 or more times non greedy.

\n\1</bPoint> ... find a line-feed, exactly the same tabs and/or spaces as at beginning the found string and end tag of element bPoint. The exact number of tabs/spaces from beginning of line to end tag is the reason why the inner bPoint elements are ignored by this search string.

[\t ]*(?:\r?\n|$) ... find 0 or more tabs or spaces and an optional carriage return and a line-feed OR end of file in case of element bPoint ends on last line of file with no line terminator.

JavaScript:

In a JavaScript script with the well formatted XML block being hold in a string variable the code to remove all most outer bPoint elements would be:

// String variable sXmlBlock contains the well formatted XML block.
do
{
    var nXmlBlockLength = sXmlBlock.length;
    sXmlBlock = sXmlBlock.replace(/(^|\n)([\t ]*)<bPoint.*?>[\s\S]+?\n\2<\/bPoint>[\t ]*(?:\r?\n|$)/g,"$1");
}
while ((sXmlBlock.length < nXmlBlockLength) && (sXmlBlock.length > 0));

The loop is necessary in case of multiple bPoint elements are in series as in the example and all of them should be removed from the XML block.

For this input XML block:

<tag>value 1</tag>
<bPoint id="1" >
     <bLabel>
       <text></text>
     </bLabel>
     <content src="p112" />
     <bPoint id="2">
         <bLabel>
            <text>xxx</text>
         </bLabel>
          <content src="p1123" />
     </bPoint>
</bPoint>
<bPoint id="bPoint-2" >
    <bLabel>
          <text>xxx</text>
    </bLabel>
    <content src="p1124" />
</bPoint>
<tag>value 2</tag>
<bPoint id="bPoint-3" >
    <bLabel>
          <text>xxx</text>
    </bLabel>
    <content src="p1125" />
</bPoint>

the script code produces as output:

<tag>value 1</tag>
<tag>value 2</tag>

It is of course possible to modify the search expression to remove just a specific bPoint element based on a criteria. But the question is not clear enough what should be removed and what is the criteria for the removal. An example showing us input to script and output of script with explaining the criteria(s) would have helped here a lot to understand the requirements for the replacement task.

Thanks for reply. i have tried this using regex but vain because the end may be in different position(dynamic and nested). We cannot put regex in that. I have tried this in XML DOM to read exact match open close bPoint. Note: I am using node.js. — Vanarajan, Jun 15 '15 at 06:24

How to get multiple nested open and close tag using regex?

1 Answers1