0

I have several hundred XML files which i need to make a slight change to. I'm aware that i really should be using XSLT to make batch changes to XML structure, but i think some quick and dirty Regex will do what i need much faster than me working out the XSLT. At least i thought that before spending hours trying to get the Regex right!!

Take the below example, what i have is various lists <seqlist> which contain <items> elements for each item in the list. Each <item> element contains a <para> element which has various ID attribute values. I want to remove those <para> tags completely so that the <item> contains the actual text.

So from: <seqlist><item><para id="1.1">Some text here.</para></item></seqlist> To: <seqlist><item>Some text here.</item></seqlist>

This is fairly strightforward in itself i can simply do:

Regex: <item><para id="([^\"]*)"> Replace: <item>

Then remove the redundant closing tags by doing a simple find replace

Find: </para></item> Replace: </item>.

However, as can be seen from the example below, some <item> elements in the list, contain another <seqlist> nested within them, which contains further nested <item> ad <para> tags. This means the above find replace to remove the closing </para> tag will result in the closing </para> in the very last line in the example below being replaced too.

Basically what i need to say is: find </para></item> and replace with </item> UNLESS there is a opening <para> element to the left of it.

The very last line of the example below explains it better. If i do the above Find & Replace the last </para> will be removed and it will not parse.

Any ideas how to achive this please?

<seqlist>
  <item><para id="p7.1"><emphasis>JRK Type 1</emphasis>: (NSP XX-XX-XXX-XXXX)
outputs:
   <seqlist>
     <item><para id="p7.1.1">12 V or 15 V,0-5A</para></item>
     <item><para id="p7.1.2">12 V or 15 V,0-5A</para></item>
   </seqlist></para>
      <para>Both at 120 W maximum output power.</para><para>The outputs are isolated, permitting parallel or serial connection to provide power as required.</para></item>
    <item><para id="p7.2"><emphasis>JRK Type 2:</emphasis> (NSN 6130-99-788-6945) outputs:</para>
   <seqlist>
     <item><para id="p7.2.1">5 V, 0 - 30 A</para></item>
     <item><para id="p7.2.2">12 V, 0 - 0.5 A</para></item>
   </seqlist><para>Both at 120 W maximum output power.</para>
  <para>The 12 V outputs are measured with respect to a common 0 V line but these are isolated from the 5 V output.</para></item>
</seqlist>
Daedalus
  • 539
  • 2
  • 6
  • 16
  • You don't need xslt, just write a script that parses the xml (using an xml parser), makes the change and spits out what you want. – pvg Mar 20 '17 at 11:08
  • You could do a first pass that removes all non-nested occurrences - using a tool that reports the number of matches. Then repeat on the resultant set of files -- and do so iteratively until you get no more matches. – GavinBrelstaff Mar 20 '17 at 11:09
  • Are you asking how to do this with regex or with XSLT? This is trivial in XSLT. See also: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – michael.hor257k Mar 20 '17 at 11:11
  • If it can be done with XSLT i'll be happy to shown the way. I've not used XSLT many times and it would take me hours to work out how to do this. Hence the Regex route....which isn't a simple solution either it seems. – Daedalus Mar 20 '17 at 11:52

1 Answers1

0

Here is the trivial XSLT way:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="seqlist/item/para">
        <xsl:apply-templates/>
    </xsl:template>
</xsl:transform>

Online at http://xsltransform.net/3NSSEw6.

If only those para elements with an id attribute are to be removed then use

<xsl:template match="seqlist/item/para[@id]">
    <xsl:apply-templates/>
</xsl:template>

for that template instead, http://xsltransform.net/3NSSEw6/1.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110