0

I have an XML file with a structure like the following:

<?xml version = '1.0' encoding="ISO-8859-1"?>
<!DOCTYPE stuff PUBLIC "stuff" "stuff.dtd">
<stuff>
  <level1>
    <type>foo</type>
    <name>name1_A</name>
    <junk1>garbage</junk1>
    <junk2>garbage</junk2>
    <level2>
      <name>name2_A</name>
      <junk3>garbage</junk3>
      <junk4>garbage</junk4>
      <level3>
        <name>name3_A</name>
        <junk5>garbage</junk5>
        <junk6>garbage</junk6>
      </level3>
      <level3>
        <name>name3_B</name>
        <junk5>garbage</junk5>
        <junk6>garbage</junk6>
      </level3>
    </level2>
    <level2>
      <name>name2_B</name>
      <junk>garbage</junk>
      <level3>
        <name>name3_A</name>
        <junk>garbage</junk>
      </level3>
      <level3>
        <name>name3_B</name>
        <junk>garbage</junk>
      </level3>
    </level2>
  </level1>
  <level1>
    <type>foo</type>
    <name>name1_B</name>
    <junk1>garbage</junk1>
    <junk2>garbage</junk2>
    <level2>
      <name>name2_A</name>
      <junk3>garbage</junk3>
      <junk4>garbage</junk4>
      <level3>
        <name>name3_A</name>
        <junk5>garbage</junk5>
        <junk6>garbage</junk6>
      </level3>
      <level3>
        <name>name3_B</name>
        <junk5>garbage</junk5>
        <junk6>garbage</junk6>
      </level3>
    </level2>
    <level2>
      <name>name2_B</name>
      <junk>garbage</junk>
      <level3>
        <name>name3_A</name>
        <junk>garbage</junk>
      </level3>
      <level3>
        <name>name3_B</name>
        <junk>garbage</junk>
      </level3>
    </level2>
  </level1>
</stuff>

I'd like to write an XSLT to filter out all the elements named junk*. That is, I know the element names that I want to keep and want to get rid of everything else. The desired end result with the above starting point would look like this with all the junk elements stripped out:

<?xml version = '1.0' encoding="ISO-8859-1"?>
<!DOCTYPE stuff PUBLIC "stuff" "stuff.dtd">
<stuff>
  <level1>
    <type>foo</type>
    <name>name1_A</name>
    <level2>
      <name>name2_A</name>
      <level3>
        <name>name3_A</name>
      </level3>
      <level3>
        <name>name3_B</name>
      </level3>
    </level2>
    <level2>
      <name>name2_B</name>
      <level3>
        <name>name3_A</name>
      </level3>
      <level3>
        <name>name3_B</name>
      </level3>
    </level2>
  </level1>
  <level1>
    <type>foo</type>
    <name>name1_B</name>
    <level2>
      <name>name2_A</name>
      <level3>
        <name>name3_A</name>
      </level3>
      <level3>
        <name>name3_B</name>
      </level3>
    </level2>
    <level2>
      <name>name2_B</name>
      <level3>
        <name>name3_A</name>
      </level3>
      <level3>
        <name>name3_B</name>
      </level3>
    </level2>
  </level1>
</stuff>

Keep in mind the various junk elements I have in my sample could be named anything - I have the list of element names I want to keep (e.g. level1/type, level1/name, level1/level2/name, level1/level2/level3/name, etc.) and want to drop everything else.

The best I've got so far is this XSLT, but here I have to explicitly list all the element names I want to remove, not the ones I want to keep, so it's less than ideal:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="no"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="junk1 | junk2 | junk3 | junk4 | junk5 | junk6"/>

</xsl:stylesheet>
  • Then you'll have to explicitly name all the elements you want to keep. Since the junk elements can occur at any level, writing a template to keep `` but discard any junk elements it contains will be much harder. What you have is already the optimal approach, why do you think you can improve on it? – Jim Garrison Nov 07 '17 at 05:53
  • If you think this is already optimal... I'm not going to argue! I've spent a while looking for a better approach to no avail. As you point out, my problem seems to be the fact that the junk elements can appear at any level. I've found a number of solutions that would deal with the case where they're all children at a particular level, but not scattered as I have them. – KrumpetMuncher Nov 07 '17 at 05:57

1 Answers1

0

Instead of enumerating all the node names that you want to ignore one after, you could regroup them in different categories if they share some common characteristics in their names:

  • all the tags starting with //*[starts-with(name(), 'junk')]
  • all the tags ending by //*[ends-with(name(), 'junk')]
  • all the tags containing a specific sub-string. //*[contains(.,'junk')]

If you don't know exactly the name of the tags to be removed you could change the logic of your XSLT and apply only on the name of the nodes you want to keep and the copy operation.

If you know only the names of the tags you want to ignore then use the following logic:

If by "node" you mean element, then use:

<xsl:template match="*[not(self::ServiceNode)]">

If by "node" you mean any node (of type element, text, comment, processing-instruction): use

<xsl:template match="node()[not(self::ServiceNode)]">

If you want only children of Document to be matched use:

<xsl:template match="Document/node()[not(self::ServiceNode)]">

If you want only children of the top element to be matched use:

<xsl:template match="/*/node()[not(self::ServiceNode)]">

How to write a xpath to match all elements except a particular element

Allan
  • 12,117
  • 3
  • 27
  • 51
  • I *do* know the names of the nodes I want to keep. I want to throw out everything *except* my known list of node names. My problem has been understanding how to do that when the node names to keep appear at different levels of the hierarchy. – KrumpetMuncher Nov 08 '17 at 02:54