9

I have a number of XML files containing lots of overhead. I wish to keep only about 20 specific elements and filter out anything else. I know all the names of the elements I want to keep, I also know whether or not they are child elements and who are their parents. These elements that I want to keep after the transformation need to still have their original hierarchic placement.

E.g. I want to keep ONLY

<ns:currency>

in;

<ns:stuff>
 <ns:things>
  <ns:currency>somecurrency</ns:currency>
  <ns:currency_code/>
  <ns:currency_code2/>
  <ns:currency_code3/>
  <ns:currency_code4/>
 </ns:things>
</ns:stuff>

And make it look like this;

<ns:stuff>
 <ns:things>
  <ns:currency>somecurrency</ns:currency>
 </ns:things>
</ns:stuff>

What would be the best way of constructing an XSLT to accomplish this?

cc0
  • 1,960
  • 7
  • 40
  • 57
  • 1
    Possible duplicate of [How to remove elements from xml using xslt with stylesheet and xsltproc?](http://stackoverflow.com/questions/321860/how-to-remove-elements-from-xml-using-xslt-with-stylesheet-and-xsltproc) – MarcoS Apr 26 '11 at 12:09
  • In that example you specify which elements to leave out, I need to specify the elements to leave in and filter out anything else. – cc0 Apr 26 '11 at 12:13
  • I agree with MarcoS. It is a duplicate. The accepted answer is pretty much what you need – Lukas Eder Apr 26 '11 at 12:16
  • For the specific example I gave, you could use that solution, I agree. But for the general question I gave, it does not work. As I explained; I need the inverse solution of this, I need to specify what I want to keep, not what I want to remove; There are far more elements I need to remove than I need to keep, thus it would not make sense to specify all the ones I want to remove. – cc0 Apr 26 '11 at 12:22
  • Maybe you can try reverting the body of both templates... – Robert Bossy Apr 26 '11 at 12:30
  • I can't see how that would work. Maybe I don't understand you correctly, do you have an example? – cc0 Apr 26 '11 at 12:32
  • 1
    Good question, +1. See my answer for a general solution that can be used to preserve any element whose name is in a "white-list" and also preserve the structural relationships of these elements in the document. You can always use this transformation for any such kind of task. – Dimitre Novatchev Apr 26 '11 at 13:08
  • 1
    Also added extensive explanation. :) – Dimitre Novatchev Apr 26 '11 at 13:18

2 Answers2

17

This general transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ns="some:ns">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <ns:WhiteList>
  <name>ns:currency</name>
  <name>ns:currency_code3</name>
 </ns:WhiteList>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "*[not(descendant-or-self::*[name()=document('')/*/ns:WhiteList/*])]"/>
</xsl:stylesheet>

when applied on the provided XML document (with namespace definition added to make it well-formed):

<ns:stuff xmlns:ns="some:ns">
    <ns:things>
        <ns:currency>somecurrency</ns:currency>
        <ns:currency_code/>
        <ns:currency_code2/>
        <ns:currency_code3/>
        <ns:currency_code4/>
    </ns:things>
</ns:stuff>

produces the wanted result (white-listed elements and their structural relations are preserved):

<ns:stuff xmlns:ns="some:ns">
   <ns:things>
      <ns:currency>somecurrency</ns:currency>
      <ns:currency_code3/>
   </ns:things>
</ns:stuff>

Explanation:

  1. The identity rule/template copies all nodes "as-is".

  2. The stylesheet contains a top-level <ns:WhiteList> element whose <name> children specify all white-listed element's names -- the elements that are to be preserved with their structural relationships in the document.

  3. The <ns:WhiteList> element is best kept in a separate document so that the current stylesheet will not need to be edited with new names. Here the whitelist is in the same stylesheet just for convenience.

  4. One single template is overriding the identity template. It doesn't process (deletes) any element that is not white-listed and has no descendent that is white-listed.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • What does 'node()|@*' matches exactly? – snoofkin Apr 26 '11 at 13:50
  • 2
    @soulSurfer2010: every `node()` (element, text, comment processing instruction) and every attribute. – Dimitre Novatchev Apr 26 '11 at 14:33
  • +1 Better semantic. Although one should also test for namespace URI, and for this question a more simple `*[not(descendant-or-self::ns:currency)]` would be enough. –  Apr 26 '11 at 16:42
  • 1
    @Alejandro: Yes, but I am providing a more general solution that solves a whole class of such problems. – Dimitre Novatchev Apr 26 '11 at 17:29
  • I bumped into an interesting issue, I have a child elements with the name "number" as a child of multiple parents, such as "student" and "teacher". If I only want to list the "number" elements existing as children of "student", and not the ones of "teacher". What is a good approach? Perhaps a way to specify which hierarch(y/ies) the "number" element can legally belong to, but how? – cc0 Apr 27 '11 at 06:24
  • @Dimitre Novatchev: can I replace 'node()|@*' with '* | @*'?? – snoofkin Apr 27 '11 at 12:27
  • @soulSurfer2010: Generally, no -- this is the most generic template and it should remain generic. However, if you insist on breaking it beauty, and provided there aren't any text, comment and PI nodes (or you don't want them copied), then yes, you could use the match pattern you propose. Why don't you just try and see if this is what you want? Dare to experiment yourself. – Dimitre Novatchev Apr 27 '11 at 12:49
  • 1
    @cc0: In such more complicated cases the white-listing table looses its advantages. The best and most flexible way forward in such cases is having templates like ``. This template ignores any `teacher/number` elements (effextively "deleting" them. All such templates should be put (for best flexibility and convenience) in a separate `` and it should be imported (``) from the primary stylesheet of the XSLT application. – Dimitre Novatchev Apr 27 '11 at 12:55
7

In XSLT you usually don't remove the elements you want to drop, but you copy the elements you want to keep:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:ns="http://www.example.com/ns#"
    version="1.0">

    <xsl:output method="xml" indent="yes" omit-xml-declaration="no"/>

     <xsl:template match="/ns:stuff">
        <xsl:copy>
            <xsl:apply-templates select="ns:things"/>
        </xsl:copy>
     </xsl:template>

     <xsl:template match="ns:things">
        <xsl:copy>
            <xsl:apply-templates select="ns:currency"/>
            <xsl:apply-templates select="ns:currency_code3"/>                   
        </xsl:copy>
     </xsl:template>

     <xsl:template match="ns:currency">
        <xsl:copy-of select="."/>
     </xsl:template>

     <xsl:template match="ns:currency_code3">
        <xsl:copy-of select="."/>
     </xsl:template>

</xsl:stylesheet>

The example above copies only currency and currency_code3. The output is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<ns:stuff xmlns:ns="http://www.example.com/ns#">
   <ns:things>
      <ns:currency>somecurrency</ns:currency>
      <ns:currency_code3/>
   </ns:things>
</ns:stuff>

Note: I added a namespace declaration for your prefix ns.

If you want to copy everything except a few elements, you may see this answer

Community
  • 1
  • 1
MarcoS
  • 13,386
  • 7
  • 42
  • 63