2

I'm transforming a > 2GB file with a lookup template in the XSLT. I would like this to run faster but can't find any low hanging fruit to improve performance. Any help would be greatly appreciated. I'm a newb when it comes to transformations.

This is the current format of the XML file.

<?xml version="1.0" encoding="utf-8" ?>
<contacts>
    <contact>
        <attribute>
            <name>text12</name>
            <value>B00085590</value>
        </attribute>
        <attribute>
            <name>text34</name>
            <value>Atomos</value>
        </attribute>
        <attribute>
            <name>date866</name>
            <value>02/21/1991</value>
        </attribute>
    </contact>
    <contact>
        <attribute>
            <name>text12</name>
            <value>B00058478</value>
        </attribute>
        <attribute>
            <name>text34</name>
            <value>Balderas</value>
        </attribute>
        <attribute>
            <name>date866</name>
            <value>11/24/1997</value>
        </attribute>
    </contact>
</contacts>

The xslt I used for the transformation.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
>
    <xsl:output method="xml" indent="yes"/>


    <!--Identify location of the lookup xml-->
    <xsl:param name="lookupDoc" select="document('C:\Projects\Attributes.xml')" />  

    <!--Main Template-->
    <xsl:template match="/contacts">        

            <!--Apply Formatted Contacts Template-->
            <xsl:apply-templates select="contact" />            

    </xsl:template>

    <!--Formatted Contacts Template-->
    <xsl:template match="contact">
        <contact>
            <xsl:for-each select="attribute">
                <!--Create variable to hold New Name after passing the Data Name to the Lookup Template-->
                <xsl:variable name="newName">
                    <xsl:apply-templates select="$lookupDoc/attributes/attribute">
                        <xsl:with-param name="nameToMatch" select="name" />
                    </xsl:apply-templates>
                </xsl:variable>     
                <!--Format Contact Element with New Name variable-->
                <xsl:element name="{$newName}">
                    <xsl:value-of select="value"/>
                </xsl:element>          
            </xsl:for-each>
        </contact>
    </xsl:template>

    <!--Lookup Template-->
    <xsl:template match="attributes/attribute">
        <xsl:param name="nameToMatch" />            
            <xsl:value-of select='translate(translate(self::node()[name = $nameToMatch]/mappingname, "()*%$#@!~&lt;&gt;&apos;&amp;,.?[]=-+/\:1234567890", "")," ","")' />
        </xsl:template>


</xsl:stylesheet>

Sample Lookup XML

<?xml version="1.0" encoding="utf-8" ?>
<attributes>
    <attribute>
        <name>text12</name>
        <mappingname>ID</mappingname>
        <datatype>Varchar2</datatype>
        <size>30</size>
    </attribute>
    <attribute>
        <name>text34</name>
        <mappingname>Last Name</mappingname>
        <datatype>Varchar2</datatype>
        <size>30</size>
    </attribute>
    <attribute>
        <name>date866</name>
        <mappingname>DOB</mappingname>
        <datatype>Date</datatype>
        <size></size>
    </attribute>
</attributes>

Transformed XML

<?xml version="1.0" encoding="utf-8" ?>
<contacts>
    <contact>
        <ID>B00085590</ID>
        <LastName>Brady</LastName>
        <DOB>02/21/1991</DOB>
    </contact>
    <contact>
        <ID>B00058478</ID>
        <LastName>Balderas</LastName>
        <DOB>11/24/1997</DOB>
    </contact>
</contacts>

C#

XsltSettings settings = new XsltSettings(true, true);
XslCompiledTransform ContactsXslt = new XslCompiledTransform();
ContactsXslt.Load(@"C:\Projects\ContactFormat.xslt", settings, new XmlUrlResolver());

using (XmlReader r = XmlReader.Create(@"C:\Projects\Contacts.xml")){
   using (XmlWriter w = XmlWriter.Create(@"C:\Projects\FormattedContacts.xml")) {
      w.WriteStartElement("contacts");
      while (r.Read()) {                        
         if (r.NodeType == XmlNodeType.Element && r.Name == "contact") {
            XmlReader temp = new XmlTextReader(new StringReader(r.ReadOuterXml()));                                
            ContactsXslt.Transform(temp, null, w);                            
         }
      }                        
   }
}

The approach I'm taking is transforming 1 node at a time to avoid an OutOfMemoryException. Should I be feeding larger chunks through to speed up the process? Or am I going about this all wrong?

aybrady
  • 39
  • 1
  • 7
  • I wonder whether `XmlReader temp = new XmlTextReader(new StringReader(r.ReadOuterXml()));` is necessary, can't you just pass the `XmlReader r` you have positioned on a `contact` element directly as the first argument to the `Transform` method? Or does `XslCompiledTransform` then close you the `XmlReader`? But even in that case I think doing `XmlReader temp = r.ReadSubtree()` is preferred and intended API use. – Martin Honnen Aug 08 '18 at 08:27

3 Answers3

1

I think you can simplify the XSLT code

       <xsl:for-each select="attribute">
            <!--Create variable to hold New Name after passing the Data Name to the Lookup Template-->
            <xsl:variable name="newName">
                <xsl:apply-templates select="$lookupDoc/attributes/attribute">
                    <xsl:with-param name="nameToMatch" select="name" />
                </xsl:apply-templates>
            </xsl:variable> 

using the template

   <xsl:template match="attributes/attribute">
    <xsl:param name="nameToMatch" />            
        <xsl:value-of select='translate(translate(self::node()[name = $nameToMatch]/mappingname, "()*%$#@!~&lt;&gt;&apos;&amp;,.?[]=-+/\:1234567890", "")," ","")' />
    </xsl:template>

to

       <xsl:for-each select="attribute">
            <!--Create variable to hold New Name after passing the Data Name to the Lookup Template-->
            <xsl:variable name="newName">
                <xsl:apply-templates select="$lookupDoc/attributes/attribute[name = current()/name]"/>
            </xsl:variable> 

with the template being simplified to

   <xsl:template match="attributes/attribute">
        <xsl:value-of select='translate(translate(mappingname, "()*%$#@!~&lt;&gt;&apos;&amp;,.?[]=-+/\:1234567890", "")," ","")' />
    </xsl:template>

I think that for sure is a more concise and XSLT way of expressing the approach, whether it improves performance is something you would have to test.

In general with XSLT to improve performance of cross-references/lookups it is recommended to use a key so you would use

<xsl:key name="att-lookup" match="attributes/attribute" use="name"/>

and then use it as

            <xsl:variable name="name" select="name"/>
            <xsl:variable name="newName">
                <!-- in XSLT 1 we need to change the context doc for the key lookup -->
                <xsl:for-each select="$lookupDoc">
                   <xsl:apply-templates select="key('att-lookup', $name)"/>
            </xsl:variable> 

I think that would considerable speed up the lookup in a single transformation, as you combine XmlReader and XSLT to run the XSLT many times on as many elements your XmlReader finds I can't tell whether it helps a lot, you would need to try.

As pointed out in the XSLT 3 suggestion, I would also consider transforming the lookup file first and once to avoid the repetition of all those translate calls to create proper XML element names. Either do that outside of the existing XSLT or do it inside by using a variable and then exsl:node-set to convert the result tree fragment into a variable. But in your case as you run the XSLT repeatedly I think it is probably better to first transform the lookup document outside of the main XSLT, to avoid having to do all those translates again and again.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • This worked great. The performance gained by using the key made the transformation time go from hours to under 5 minutes. I'll have to explore XSLT 3.0 in the future. Thanks for your help! – aybrady Aug 09 '18 at 02:59
0

When reading huge xml files always use XmlReader. I like using a combination of XmlReader and Xml linq. I also like using dictionaries. See code below :

using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {

            XmlReader reader = XmlReader.Create(FILENAME);
            while (!reader.EOF)
            {
                if (reader.Name != "contact")
                {
                    reader.ReadToFollowing("contact");
                }
                if (!reader.EOF)
                {
                    XElement xContact = (XElement)XElement.ReadFrom(reader);
                    Contact newContact = new Contact();
                    Contact.contacts.Add(newContact);

                    newContact.attributes = xContact.Descendants("attribute")
                        .GroupBy(x => (string)x.Element("name"), y => (string)y.Element("value"))
                        .ToDictionary(x => x.Key, y => y.FirstOrDefault());
                }
            }
        }
    }
    public class Contact
    {
        public static List<Contact> contacts = new List<Contact>();

        public Dictionary<string, string> attributes { get; set; }
    }
 }
jdweng
  • 33,250
  • 2
  • 15
  • 20
0

As an alternative, you might want to look into solving the task with XSLT 3 and its streaming feature (https://www.w3.org/TR/xslt-30/#streaming-concepts) as there you could process the huge input file in a forwards only but declarative way where you only in the template for the attribute element you need to ensure you work with a intentionally created full copy of that element to allow XPath navigation to the child elements. Additionally I think it makese sense to read in the lookup document only once and do the translate calls to create the proper element names only once. So the following is a streaming XSLT 3 solution runnable with Saxon 9.8 EE which transforms the lookup document into an XPath 3.1 map (https://www.w3.org/TR/xpath-31/#id-maps) and otherwise uses a streamable mode to process the large, main input:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:map="http://www.w3.org/2005/xpath-functions/map"
    exclude-result-prefixes="xs map"
    version="3.0">

    <!-- could of course load the document using select="document('lookup.xml')" instead of inlining it as done here just for the example and testing -->
    <xsl:param name="lookup-doc">
        <attributes>
            <attribute>
                <name>text12</name>
                <mappingname>ID</mappingname>
                <datatype>Varchar2</datatype>
                <size>30</size>
            </attribute>
            <attribute>
                <name>text34</name>
                <mappingname>Last Name</mappingname>
                <datatype>Varchar2</datatype>
                <size>30</size>
            </attribute>
            <attribute>
                <name>date866</name>
                <mappingname>DOB</mappingname>
                <datatype>Date</datatype>
                <size></size>
            </attribute>
        </attributes>      
    </xsl:param>

    <xsl:variable 
        name="lookup-map"
        as="map(xs:string, xs:string)"
        select="map:merge(
        $lookup-doc/attributes/attribute 
        ! 
        map { 
        string(name) : translate(translate(mappingname, '()*%$#@!~&lt;&gt;''&amp;,.?[]=-+/\:1234567890', ''), ' ','')
        }
        )"/>

    <xsl:mode on-no-match="shallow-copy" streamable="yes"/>

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="contact/attribute">
        <xsl:variable name="attribute-copy" select="copy-of()"/>
        <xsl:element name="{$lookup-map($attribute-copy/name)}">
            <xsl:value-of select="$attribute-copy/value"/>
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

Online sample (there running with Saxon 9.8 HE which ignores the streaming and does normal XSLT processing) is at https://xsltfiddle.liberty-development.net/bFDb2Ct/1.

To run streaming XSLT 3 with Saxon 9.8 and C# you use http://saxonica.com/html/documentation/dotnetdoc/Saxon/Api/Xslt30Transformer.html and set up ApplyTemplates on an input Stream with your huge input XML (http://saxonica.com/html/documentation/dotnetdoc/Saxon/Api/Xslt30Transformer.html#ApplyTemplates(System.IO.Stream,Saxon.Api.XmlDestination)).

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110