2

XSL is hard. The answers to my question here got me mostly on the right track, but there are a few small things that I initially overlooked. Here is my latest attempt:

XSL:

<!--
    When a file is transformed using this stylesheet the output will be
    formatted as follows:

    1.)  Elements named "info" will be removed
    2.)  Attributes named "file_line_nr" or "file_name" will be removed
    3.)  Comments will be removed
    4.)  Processing instructions will be removed
    5.)  XML declaration will be removed
    6.)  Extra whitespace will be removed
    7.)  Empty attributes will be removed
    8.)  Elements void of both attributes and child elements will be removed
    9.)  All elements will be sorted by name recursively
    10.) All attributes will be sorted by name
-->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" method="xml" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <!--
        Elements/attributes to remove.  Note that comments are not elements or
        attributes.  Since there is no template to match comments they are
        automatically ignored.
    -->
    <xsl:template match="@*[normalize-space()='']|info|@file_line_nr|@file_name"/>

    <!-- Match any attribute -->
    <xsl:template match="@*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <!-- Match any element -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*">
                <xsl:sort select="name()"/>
            </xsl:apply-templates>
            <xsl:apply-templates>
                <xsl:sort select="name()"/>
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

I think I have addressed every one of my requirements except number 8. I can successfully make a stylesheet that removes elements that don't have children, or that removes elements that do not have attributes, but that isn't what I want. I only want to remove elements that don't have attributes, child elements, or text.

Input XML:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!-- XML declaration should be removed -->
<foo b="b" a="a" c="c">
    <?some-app inst="some instruction"?><!-- Processing instructions should be removed -->
    <qwer><!-- Keep elements like this because it has child elements -->
        <zxcv c="c" b="b"/><!-- Keep elements like this because it has attributes -->
        <id>some text</id><!-- Keep elements like this because it has text -->
        <info i="i"/><!-- Elements named "info" are to be removed -->
        <rewq file_line_nr="42" file_name="somefile.txt"/><!-- Attributes named "file_line_nr" and "file_name" are to be removed which will leave this element empty, so it should be removed too -->
        <vcxz c="c" b="b"/>
    </qwer>
    <baz e="e" d="d"/>
    <bar>
        <fdsa g="g" f="f"/>
        <asdf g="g" f="f"/>
    </bar>
</foo>

Desired Output XML: (No comments, no whitespace/indent, elements and attributes sorted)

<foo a="a" b="b" c="c">
<bar>
<asdf f="f" g="g"/>
<fdsa f="f" g="g"/>
</bar>
<baz d="d" e="e"/>
<qwer>
<id>some text</id>
<vcxz b="b" c="c"/>
<zxcv b="b" c="c"/>
</qwer>
</foo>
Community
  • 1
  • 1
ubiquibacon
  • 10,451
  • 28
  • 109
  • 179

2 Answers2

1

This should do the job:

<xsl:stylesheet 
  version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:msxsl="urn:schemas-microsoft-com:xslt">
  <xsl:output indent="yes" method="xml" omit-xml-declaration="yes"/>
  <xsl:strip-space elements="*"/>

  <!--
        Elements/attributes to remove.  Note that comments are not elements or
        attributes.  Since there is no template to match comments they are
        automatically ignored.
    -->
  <xsl:template match="@*[normalize-space()='']|info|@file_line_nr|@file_name"/>

  <!-- Match any attribute -->
  <xsl:template match="@*">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
    </xsl:copy>
  </xsl:template>

  <!-- Match any element -->
  <xsl:template match="*">
    <xsl:variable name="elementFragment">
      <xsl:copy>
        <xsl:apply-templates select="@*">
          <xsl:sort select="name()"/>
        </xsl:apply-templates>
        <xsl:apply-templates>
          <xsl:sort select="name()"/>
        </xsl:apply-templates>
      </xsl:copy>
    </xsl:variable>
    <xsl:variable name="element" select="msxsl:node-set($elementFragment)/*"/>
    <xsl:if test="$element/@* or $element/* or normalize-space($element)">
      <xsl:copy-of select="$element"/>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

The idea is to pre-process the elements, put the result in a variable and then execute the 'element has no attributes, sub-element or text' test on the variable.

The variable is a XML fragment, it needs to be converted to a node-set using a extension function - my XSLT uses the Microsoft one msxsl:node-set - other processors have equivalent functions.

MiMo
  • 11,793
  • 1
  • 33
  • 48
  • Should I put the XPath you suggested in its own template with a lower priority than my current empty template so that blank attributes and specific elements I don't want (like `info`) will be removed first, before it is decided that an element is indeed "empty"? – ubiquibacon Sep 11 '13 at 15:45
  • Templates are applied only once, so the priority trick would not work. – MiMo Sep 11 '13 at 16:06
  • Awesome. I think this is doing what I want, but it has the unfortunate side effect of taking much longer than what I had (25 seconds to transform a 15MB file compared to 5 seconds with what I did have). I'm going to keep testing this to make sure I haven't missed any corner cases. I implemented `node-set` as described in the answer [here](http://stackoverflow.com/a/329989/288341). – ubiquibacon Sep 11 '13 at 18:03
0

The simplest way is to have a rule which suppresses processing for all elements:

<xsl:template match="*"/>

Then follow it with a rule that matches elements with one attribute or child:

<xsl:template match="*[attribute:*] | *[child::*]">
    ...process...
</xsl:template>

Or if you prefer

match="*[@*] | *[*]"