1

Being relatively new to XSLT I have what I hope is a simple question. I have some flat XML files, which can be pretty big (eg. 7MB) that I need to make 'more hierarchical'. For example, the flat XML might look like this:

<D0011>
    <b/>
    <c/>
    <d/>
    <e/>
    <b/>
    ....
    ....
</D0011>

and it should end up looking like this:

<D0011>
  <b>
    <c/>
    <d/>
    <e/>
  </b>
  <b>
 ....
 ....
</D0011>

I have a working XSLT for this, and it essentially gets a nodeset of all the b elements and then uses the 'following-sibling' axis to get a nodeset of the nodes following the current b node (ie. following-sibling::*[position()=$nodePos]). Then recursion is used to add the siblings into the result tree until another b element is found (I have parameterised it of course, to make it more generic).

I also have a solution that just sends the position in the XML of the next b node and selects the nodes after that one after the other (using recursion) via a *[position() = $nodePos] selection.

The problem is that the time to execute the transformation increases unacceptably with the size of the XML file. Looking into it with XML Spy it seems that it is the 'following-sibling' and 'position()=' that take the time in the two respective methods.

What I really need is a way of restricting the number of nodes in the above selections, so fewer comparisons are performed: every time the position is tested, every node in the nodeset is tested to see if its position is the right one. Is there a way to do that ? Any other suggestions ?

Thanks,

Mike

Nic
  • 65
  • 1
  • 8

3 Answers3

1

Yes there is a way to do it much more efficiently: See Muenchian grouping. If having looked at this you need more help with the details, let us know. The key you'll need is something like:

<xsl:key name="elements-by-group" match="*[not(self::b)]"
   use="generate-id(preceding-sibling::b[1])" />

Then you can iterate over the <b> elements, and for each one, use key('elements-by-group', generate-id()) to get the elements that immediately follow that <b>.

The task of "making the XML more hierarchical" is sometimes called up-conversion, and your scenario is a classic case for it. As you may know, XSLT 2.0 has very useful grouping features that are easier to use than the Muenchian method.

In your case it sounds like you would use <xsl:for-each-group group-starting-with="b" /> or, to parameterize the element name, <xsl:for-each-group group-starting-with="*[local-name() = 'b']" />. But maybe you already considered that and can't use XSLT 2.0 in your environment.

Update:

In response to the request for parameterization, here's a way to do it without a key. Note though that it may be much slower, depending on your XSLT processor.

<xsl:template match="D0011">
   <xsl:for-each select="*[local-name() = $sep]">
      <xsl:copy>
         <xsl:copy-of select="following-sibling::*[not(local-name() = $sep)
               and generate-id(preceding-sibling::*[local-name() = $sep][1]) =
                    generate-id(current())]" />
      </xsl:copy>
   </xsl:for-each>      
</xsl:template>

As noted in the comment, you can keep the performance benefit of keys by defining several different keys, one for each possible value of the parameter. You then select which key to use by using an <xsl:choose>.

Update 2:

To make the group-starting element be defined based on /*/*[2], instead of based on a parameter, use

<xsl:key name="elements-by-group"
   match="*[not(local-name(.) = local-name(/*/*[2]))]"
   use="generate-id(preceding-sibling::*
                           [local-name(.) = local-name(/*/*[2])][1])" />

<xsl:template match="D0011">
   <xsl:for-each select="*[local-name(.) = local-name(../*[2])]">
      <xsl:copy>
         <xsl:copy-of select="key('elements-by-group', generate-id())"/>
      </xsl:copy>
   </xsl:for-each>
</xsl:template>
LarsH
  • 27,481
  • 8
  • 94
  • 152
  • Hi Lars, Thanks very much for your suggestion. I have read through the description of Muenchian grouping, and it is very interesting. However, I can't quite see how to apply it in my situation ! (I probably need to read through chapter 6 of my XSLT book again...). There is the phrase in there 'It can be applied in any situation where you are grouping nodes according to a property of the node that is retrievable through an XPath.' - my nodes are all at the same level, so don't have anything you could apply a key to ? You're right by the way - I have to use XSLT 1.0. Cheers, Nic – Nic Jan 10 '11 at 16:51
  • Thanks for the edit ! I will give this a try and come back with more stupid questions I expect.... – Nic Jan 10 '11 at 16:56
  • @LarsH: I don't think Muenchian method has something to do here because we are not going to select "first of a kind", and also that's way the special attributes `xsl:for-each-group/@group-starting-with` and `xsl:for-each-group/@group-ending-with` –  Jan 10 '11 at 17:04
  • If you really have to solve this with XSLT 1.0 then you have my sympathy (2.0 makes it so much easier), but you have two approaches, both suggested as answers on this forum: (a) "sibling recursion" (recursing through the following-sibling axis), and (b) Muenchian grouping using "generate-id(preceding-sibling::b[1])" as the grouping key. – Michael Kay Jan 10 '11 at 22:52
  • Martin and Lars - excellent improvement in performance: at least 10 times faster ! But I read last night that the match and use attributes can't be variables (this is in the XSLT 1.0 spec apparently) - disaster ! To avoid having ten almost identical XSLT's, plus a load of logic to choose the right one, I need to be able to specify a parameter to replace 'b'. Any ideas ? Cheers. – Nic Jan 11 '11 at 09:36
  • @Alej - thanks for your comment; I'm afraid I didn't really understand it though... could you elaborate? – LarsH Jan 11 '11 at 21:48
  • @Mike: it's true, you can't use variables in the match/use attributes of a key. A couple of workarounds: 1) instead of 10 stylesheets, you could have 10 keys, and in the place where you use `key()`, have a big choose/when/test/when/test/... that selects which key to use based on your parameter. 2) instead of using a key, use the following inside your for-each select="/D0011/*[local-name() = $param1]": `following-sibling::*[not(local-name() = $param1) and generate-id(preceding-sibling::*[local-name() = $param1][last()]) = generate-id(current())]`. I'll edit my answer for better formatting. – LarsH Jan 11 '11 at 21:55
  • Oops, that `[last()]` should have been `[1]`. I always have trouble remembering what syntax preserves the reverse axis and what syntax reverts to the forward document order. – LarsH Jan 11 '11 at 22:04
  • Hmmm - the second option takes forever ! The first option could be a flier, but presumably generating the keys will take quite a while as well ? If I know that the 'b' element (ie. the one that will be at the top-level) is always the second node, can I use that in the key's match and use expressions somehow ? I have tried "D0011/*[not(self::*['/*/*[2]')]" (ugly, I know, but needs must !) but it doesn't work - perhaps because a node is returned, not a 'node type'. Of course this could be my inexperience showing through.... Thanks for your continued interest, – Nic Jan 12 '11 at 10:46
  • @Mike: "Generating the keys will take quite a while as well?" Not nearly as long as searching without keys. Like building an index, building a key should be much more efficient than searching w/o one. Yes I think you can use your extra information about the group-starting element to good effect. However in your sample input, `` is not the second child of the outermost element, so I'm not positive that I understand you correctly. – LarsH Jan 12 '11 at 17:09
  • @Mike: Re 'I have tried `D0011/*[not(self::*['/*/*[2]')]`': The inner predicate there means "such that the opaque string '/*/*[2]' is a non-empty string" which is always true. What you want is something like `D0011/*[local-name(.) = local-name(/*/*[2])]`. I just edited my answer to show this in detail. – LarsH Jan 12 '11 at 17:22
  • @Mike: BTW if this has been useful to you, please don't forget to upvote and/or accept the answer. – LarsH Jan 12 '11 at 17:29
  • Hi Lars, Thanks so much for your help with this. I have experimented and found that having multiple keys is actually just as quick as having only one - it looks like the processor only calculates what it needs (obvious really) so assuming you only use one it makes no difference having multiple keys in the XSLT. Substituting the position of the required element into the key really slows it down, but I now know how to do it ! Thanks again. – Nic Jan 13 '11 at 09:46
1
<xsl:key name="k1" match="D0011/*[not(self::b)]" use="generate-id(preceding-sibling::b[1])"/>

<xsl:template match="D0011">
  <xsl:copy>
    <xsl:apply-templates select="b"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="D0011/b">
  <xsl:copy>
    <xsl:copy-of select="key('k1', generate-id())"/>
  </xsl:copy>
</xsl:template>
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Thanks for the suggestion - I'll give it a go, and probably come back with more stupid questions ! – Nic Jan 10 '11 at 16:57
  • If I know that the 'b' element (ie. the one that will be at the top-level) is always the second node, can I use that in the match and use expressions ? I have tried "D0011/*[not(self::*['/*/*[2]')]" (ugly, I know, but needs must !) but it doesn't work - perhaps because a node is returned, not a 'node type'. Cheers, – Nic Jan 11 '11 at 15:04
  • Thanks again for your help - you and Lars really put me on the right track. – Nic Jan 13 '11 at 09:48
0

This is the fine grained trasversal pattern:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="node()|@*" name="identity">
        <xsl:copy>
            <xsl:apply-templates select="node()[1]|@*"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::node()[1]"/>
    </xsl:template>
    <xsl:template match="b[1]" name="group">
        <xsl:copy>
            <xsl:apply-templates select="following-sibling::node()[1]"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::b[1]" mode="group"/>
    </xsl:template>
    <xsl:template match="b[position()!=1]"/>
    <xsl:template match="b" mode="group">
        <xsl:call-template name="group"/>
    </xsl:template>
</xsl:stylesheet>

Output:

<D0011>
    <b>
        <c></c>
        <d></d>
        <e></e>
    </b>
    <b>
    ....
    ....
    </b>
</D0011>