1

After a lot of reference from stack overflow answers I have the following XSL which converts a CSV into XML using the column headings as the node name for each appropriate cell.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    exclude-result-prefixes="xsl">
    <xsl:output method="xml" encoding="utf-8" />

    <xsl:variable name="newline" select="'&#10;'" />
    <xsl:variable name="comma" select="','" />
    <xsl:variable name="csv" select="." />
    <xsl:variable name="fields" select="substring-before( concat( $csv, $newline ), $newline )" />

    <xsl:template match="/">
        <xsl:element name="EXCHANGE">
            <xsl:element name="DDM">
                <xsl:call-template name="write-row">
                    <xsl:with-param name="rows" select="substring-after( $csv, $newline)"/>
                </xsl:call-template>
            </xsl:element>
        </xsl:element>
    </xsl:template>

    <xsl:template name="write-row">
        <xsl:param name="rows"/>

        <xsl:variable name="this-row" select="substring-before( concat( $rows, $newline ), $newline )" />
        <xsl:variable name="remaining-rows" select="substring-after( $rows, $newline )" />

        <xsl:if test="string-length($this-row) > 1">
            <xsl:element name="DDMSRS">
                <xsl:call-template name="write-item">
                    <xsl:with-param name="columns" select="$fields"/>
                    <xsl:with-param name="row" select="$this-row" />
                </xsl:call-template>
            </xsl:element>
        </xsl:if>

        <xsl:if test="string-length( $remaining-rows ) > 0">
            <xsl:call-template name="write-row">
                <xsl:with-param name="rows" select="$remaining-rows" />
            </xsl:call-template>
        </xsl:if>
    </xsl:template>


    <xsl:template name="write-item">
        <xsl:param name="row"/>
        <xsl:param name="columns"/>

        <xsl:variable name="col" select="substring-before( concat( $columns, $comma ), $comma)" />
        <xsl:variable name="remaining-items" select="substring-after( $row, $comma )" />
        <xsl:variable name="remaining-columns" select="substring-after( $columns, $comma )" />

        <xsl:if test="$col != ''">
            <xsl:element name="{$col}">
                <xsl:value-of select="substring-before( concat( $row, $comma ), $comma)" /> 
            </xsl:element>
        </xsl:if>

        <xsl:if test="string-length( $remaining-items ) > 0">
            <xsl:call-template name="write-item">
                <xsl:with-param name="columns" select="$remaining-columns"/>
                <xsl:with-param name="row" select="$remaining-items" />
            </xsl:call-template>
        </xsl:if>
    </xsl:template>

</xsl:stylesheet>

Running the XSL on a csv like this (line breaks being the row separator):

<root><![CDATA[COL_HEAD1,COL_HEAD2,COL_HEAD3
123456789,Peter,My address
]]></root>

Will return the following xml:

<?xml version="1.0" encoding="utf-8"?>
<EXCHANGE>
    <DDM>
        <DDMSRS>
            <COL_HEAD1>123456789</COL_HEAD1>
            <COL_HEAD2>Peter</COL_HEAD2>
            <COL_HEAD3>My address</COL_HEAD3>
        </DDMSRS>
    </DDM>
</EXCHANGE>

The issue I now have is when I want to process a lot of rows in a csv (1000 or more) I run out of memory.

I have seen reference to divide and conquer in other stackoverflow questions but I can't figure out how to split my string in half.

So my questions are:

  1. How do I perform divide and conquer in this scenario?
  2. Are there any other ways of improving the performance of this XSL using XSLT 1.0?
Peter
  • 27
  • 6
  • You need a well formed xml document in order to run xslt, so 'splitting' does not sound like the right thing to do. Are you restricted to using xslt 1.0? You might try other technologies that are meant to work around the memory limitations of xslt(like SAX). What XSLT engine do you use? maybe there is more performant one you can try. – Colin D Mar 05 '14 at 18:28
  • I am restricted to xslt 1.0. When I say splitting I am referring to dividing the csv in half and processing each bit separately and repeating the division process. The point is the XSL is working I just need to improve the performance with XSLT 1.0 native functionality. – Peter Mar 05 '14 at 19:02
  • The specific error I get is: (java.lang.OutOfMemoryError): Java heap space – Peter Mar 05 '14 at 19:08
  • 1
    @Peter If it hurts to use a certain tool, use a different tool. Instead of banging on this with XSLT 1.0, use a) a proper CSV parser as an intermediary step (preferred), b) an custom XSLT extension [written in Java that you can call from within the stylesheet](http://stackoverflow.com/questions/12761744/call-java-instance-methods-in-xslt) or c) an XSLT extension like [EXSLT](http://www.exslt.org/str/index.html) that can do the tokenizing for you. Implementing something like this in vanilla XSLT 1.0 is not useful. – Tomalak Mar 05 '14 at 21:07
  • Since you're using Java, why are you limiting yourself to XSLT 1.0? This would be so much easier in XSLT 2.0, which has been available in the Java world for about ten years. – Michael Kay Mar 05 '14 at 21:08
  • The only thing I can change is the XSL itself or the csv data (using a bash script). I cannot change the tool that processes the data. It seems from the responses I am getting there is no way to improve the XSL performance unless I change the format of my input. – Peter Mar 05 '14 at 21:31
  • XSLT has not been designed for string processing, there's not a whole lot you can do about that. Have you given EXSLT a try? Maybe your processor supports it. If you can use bash to influence the CSV data, I wonder why you don't do that. There are [plenty of ways to process CSV in the shell](http://stackoverflow.com/questions/1560393/bash-shell-scripting-csv-parsing). – Tomalak Mar 05 '14 at 21:50
  • @Tomalak I wanted to use the native system to do as much of the processing as possible. However Bash seems to be the only viable solution if I want to support large csvs. – Peter Mar 05 '14 at 23:04
  • It's the sane choice anyway. XSLT is a tree transformation language that can be amazingly expressive in its domain. It's not very suitable for text parsing or general purpose programming, and this is a situation where it shows. – Tomalak Mar 05 '14 at 23:12

1 Answers1

0

Performing a divide and conquer in this scenario with XSLT is not trivial. There are ways to optimize XSLT processing and you might want to try a different processor which does things differently, but your code is not easy to improve, since it practically doesn't do any XML processing. It operates on one big single element node containing a string. You are practically only using XPath functions to parse Strings and XSLT variables to store them. It would be more efficient to get the XSLT overhead out of the way.

Your options include:

  1. Increase your memory (using the -Xmx options).
  2. Break your file into smaller parts (which need to be well formed XML) and process each part separately in XSLT.

But XSLT can't help you with option no. 2, since to start processing your file it will need to load it all into memory. It can't load part of the text and split it, since each fragment has to be well-formed XML. Not even a SAX parser might be that efficient, since you only have one node. You are better off using an efficient string parser where you could split your CSV and then wrap each fragment inside XML tags.

helderdarocha
  • 23,209
  • 4
  • 50
  • 65
  • Thanks for the response helder. As I have noted above option 2 seems to be the only viable solution. – Peter Mar 05 '14 at 21:33