5

I have a problem right now that is the result of current limitations on a server our team does not control.

We have a job that should have be done by database but we're forced to use a XML file and parse it using Javascript/jQuery. We don't even have write access for our scripts (only through our FTP account)... we don't like to talk about it but that's what we got.

The problem, as a result of those limitations, is that we need to parse a large XML file that's around 500kb, with 1700-ish records of document name/number/url.

The number is pretty complex, such as "31-2b-1029E", mixed with stuff like "T2315342".

So, I have figured that I need to use something called "Natural Sort" (thank you stackoverflow).

Anyways I tried using this script here:

/*
 * Reference: http://www.overset.com/2008/09/01/javascript-natural-sort-algorithm/
 * Natural Sort algorithm for Javascript - Version 0.6 - Released under MIT license
 * Author: Jim Palmer (based on chunking idea from Dave Koelle)
 * Contributors: Mike Grier (mgrier.com), Clint Priest, Kyle Adams, guillermo
 */
function naturalSort (a, b) {
    var re = /(^-?[0-9]+(\.?[0-9]*)[df]?e?[0-9]?$|^0x[0-9a-f]+$|[0-9]+)/gi,
        sre = /(^[ ]*|[ ]*$)/g,
        dre = /(^([\w ]+,?[\w ]+)?[\w ]+,?[\w ]+\d+:\d+(:\d+)?[\w ]?|^\d{1,4}[\/\-]\d{1,4}[\/\-]\d{1,4}|^\w+, \w+ \d+, \d{4})/,
        hre = /^0x[0-9a-f]+$/i,
        ore = /^0/,
        // convert all to strings and trim()
        x = a.toString().replace(sre, '') || '',
        y = b.toString().replace(sre, '') || '',
        // chunk/tokenize
        xN = x.replace(re, '\0$1\0').replace(/\0$/,'').replace(/^\0/,'').split('\0'),
        yN = y.replace(re, '\0$1\0').replace(/\0$/,'').replace(/^\0/,'').split('\0'),
        // numeric, hex or date detection
        xD = parseInt(x.match(hre)) || (xN.length != 1 && x.match(dre) && Date.parse(x)),
        yD = parseInt(y.match(hre)) || xD && y.match(dre) && Date.parse(y) || null;
    // first try and sort Hex codes or Dates
    if (yD)
        if ( xD < yD ) return -1;
        else if ( xD > yD ) return 1;
    // natural sorting through split numeric strings and default strings
    for(var cLoc=0, numS=Math.max(xN.length, yN.length); cLoc < numS; cLoc++) {
        // find floats not starting with '0', string or 0 if not defined (Clint Priest)
        oFxNcL = !(xN[cLoc] || '').match(ore) && parseFloat(xN[cLoc]) || xN[cLoc] || 0;
        oFyNcL = !(yN[cLoc] || '').match(ore) && parseFloat(yN[cLoc]) || yN[cLoc] || 0;
        // handle numeric vs string comparison - number < string - (Kyle Adams)
        if (isNaN(oFxNcL) !== isNaN(oFyNcL)) return (isNaN(oFxNcL)) ? 1 : -1; 
        // rely on string comparison if different types - i.e. '02' < 2 != '02' < '2'
        else if (typeof oFxNcL !== typeof oFyNcL) {
            oFxNcL += ''; 
            oFyNcL += ''; 
        }
        if (oFxNcL < oFyNcL) return -1;
        if (oFxNcL > oFyNcL) return 1;
    }
    return 0;
}

And applied using:

// Natural Sort (disabled because it is super freaking slow.... need xsl transform sorting instead)
var sortedSet = $(data).children("documents").children("document").sort(function(a, b) {
    return naturalSort($(a).children('index').text(), $(b).children('index').text());
});

This works fine on our other, smaller XML file, but for the giant 500kb-ish file Safari (v4) just plainly hangs for up to a few minutes to sort this through, while Firefox (latest) takes around 10 second to process (still not good, but at least sane).

I also found this other smaller/lighter script called Alphanum:

function alphanum(a, b) {
  function chunkify(t) {
    var tz = [], x = 0, y = -1, n = 0, i, j;

    while (i = (j = t.charAt(x++)).charCodeAt(0)) {
      var m = (i == 46 || (i >=48 && i <= 57));
      if (m !== n) {
        tz[++y] = "";
        n = m;
      }
      tz[y] += j;
    }
    return tz;
  }

  var aa = chunkify(a);
  var bb = chunkify(b);

  for (x = 0; aa[x] && bb[x]; x++) {
    if (aa[x] !== bb[x]) {
      var c = Number(aa[x]), d = Number(bb[x]);
      if (c == aa[x] && d == bb[x]) {
        return c - d;
      } else return (aa[x] > bb[x]) ? 1 : -1;
    }
  }
  return aa.length - bb.length;
}

This runs faster for Safari, but is still locks up the browser for a minute or so.

I did some research, and it seems that a few people recommended using XSL to sort the XML entries, which apparently is much faster due to it's being built into the browser instead of running on top of JavaScript.

There's apparently several different implementations, with Sarissa getting getting mentioned several times, the sourceforge page seems to indicate that the last update occured back in 2011-06-22.

There's also other choices such as xslt.js

My question is:

  1. Is XSL the best sorting option for this particular problem?
  2. If so how can I use XSL to do Natural Sort? (url to resources?)
  3. If yes to both questions, which library should I use for the best compatibility and speed?
  4. If XSL is not the best choice, then which one is?

Thanks for looking at my problem.

Shaw
  • 151
  • 1
  • 8
  • possible duplicate of [natural sort of text and numbers, JavaScript](http://stackoverflow.com/questions/2802341/natural-sort-of-text-and-numbers-javascript) – Kevin Peno Dec 01 '11 at 22:55
  • From all this long question I didn't understand at all what this sort should accomplish. Please, edit your question and provide: 1. A small but complete example of an XML file both before and after the sorting. 2. Explanation of the required rules/properties of the sorting operation. – Dimitre Novatchev Dec 02 '11 at 02:32

3 Answers3

5

Good question, +1.

Here is an XSLT 1.0 solution (there is an XSLT 2.0 solution that is much simpler and easier to write and is probably more efficient, however none of the 5 major browsers comes with an XSLT 2.0 processor):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common" exclude-result-prefixes="xml">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vDigits" select="'0123456789'"/>

 <xsl:variable name="vPadding" select=
 "'                    '"/>

 <xsl:variable name="vMaxNumLength"
      select="string-length($vPadding)"/>

 <xsl:template match="/">
  <xsl:variable name="vrtfPass1">
   <t>
    <xsl:apply-templates/>
   </t>
  </xsl:variable>

  <xsl:variable name="vPass1" select="ext:node-set($vrtfPass1)"/>

  <t>
    <xsl:for-each select="$vPass1/*/*">
     <xsl:sort select="@sortMe"/>

     <xsl:copy>
      <xsl:value-of select="."/>
     </xsl:copy>
    </xsl:for-each>
  </t>
 </xsl:template>

 <xsl:template match="str">
   <str>
    <xsl:apply-templates select="text()" mode="normalize"/>
    <xsl:copy-of select="text()"/>
   </str>
 </xsl:template>

 <xsl:template match="text()" mode="normalize" name="normalize">
  <xsl:param name="pText" select="."/>
  <xsl:param name="pAccum" select="''"/>

  <xsl:choose>
   <xsl:when test="not(string-length($pText) >0)">
     <xsl:attribute name="sortMe">
       <xsl:value-of select="$pAccum"/>
     </xsl:attribute>
   </xsl:when>
   <xsl:otherwise>
    <xsl:variable name="vChar1" select="substring($pText,1,1)"/>

    <xsl:choose>
     <xsl:when test="not(contains($vDigits,$vChar1))">
       <xsl:variable name="vDig1" select=
       "substring(translate($pText,
                            translate($pText, $vDigits, ''),
                            ''
                            ),
                  1,1)"/>
       <xsl:variable name="vDig">
        <xsl:choose>
         <xsl:when test="string-length($vDig1)">
          <xsl:value-of select="$vDig1"/>
         </xsl:when>
         <xsl:otherwise>0</xsl:otherwise>
        </xsl:choose>
       </xsl:variable>

       <xsl:variable name="vNewText" select=
        "substring-before(concat($pText,$vDig), $vDig)"/>

       <xsl:call-template name="normalize">
        <xsl:with-param name="pText" select=
         "substring($pText, string-length($vNewText)+1)"/>
        <xsl:with-param name="pAccum" select=
        "concat($pAccum, $vNewText)"/>
       </xsl:call-template>
     </xsl:when>

     <xsl:otherwise>
      <xsl:variable name="vNonDig1" select=
      "substring(translate($pText, $vDigits, ''),1,1)"/>

      <xsl:variable name="vNonDig">
        <xsl:choose>
         <xsl:when test="string-length($vNonDig1)">
          <xsl:value-of select="$vNonDig1"/>
         </xsl:when>
         <xsl:otherwise>Z</xsl:otherwise>
        </xsl:choose>
      </xsl:variable>

      <xsl:variable name="vNum" select=
           "substring-before(concat($pText,'Z'),$vNonDig)"/>

      <xsl:variable name="vNumLength" select=
       "string-length($vNum)"/>

      <xsl:variable name="vNewText" select=
       "concat(substring($vPadding,
                         1,
                         $vMaxNumLength -$vNumLength),
               $vNum
               )"/>

       <xsl:call-template name="normalize">
        <xsl:with-param name="pText" select=
         "substring($pText, $vNumLength +1)"/>
        <xsl:with-param name="pAccum" select=
        "concat($pAccum, $vNewText)"/>
       </xsl:call-template>
     </xsl:otherwise>
    </xsl:choose>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the XML document below:

<t>
 <str>Allegia 6R Clasteron</str>
 <str>200X Radonius</str>
 <str>Xiph Xlater 10000</str>
 <str>1000X Radonius Maximus</str>
 <str>Callisto Morphamax 6000 SE</str>
 <str>10X Radonius</str>
 <str>20X Radonius</str>
 <str>30X Radonius</str>
 <str>20X Radonius Prime</str>
 <str>40X Radonius</str>
 <str>Allegia 50 Clasteron</str>
 <str>Allegia 500 Clasteron</str>
 <str>Allegia 50B Clasteron</str>
 <str>Allegia 51 Clasteron</str>
 <str>Alpha 100</str>
 <str>Alpha 2</str>
 <str>Alpha 200</str>
 <str>Alpha 2A</str>
 <str>Alpha 2A-8000</str>
 <str>Alpha 2A-900</str>
 <str>Callisto Morphamax</str>
 <str>Callisto Morphamax 500</str>
 <str>Callisto Morphamax 5000</str>
 <str>Callisto Morphamax 600</str>
 <str>Callisto Morphamax 6000 SE2</str>
 <str>Callisto Morphamax 700</str>
 <str>Callisto Morphamax 7000</str>
 <str>Xiph Xlater 2000</str>
 <str>Xiph Xlater 300</str>
 <str>Xiph Xlater 40</str>
 <str>Xiph Xlater 5</str>
 <str>Xiph Xlater 50</str>
 <str>Xiph Xlater 500</str>
 <str>Xiph Xlater 5000</str>
 <str>Xiph Xlater 58</str>
</t>

the wanted, correctly "natural-sorted" result is produced:

<t>
   <str>10X Radonius</str>
   <str>20X Radonius</str>
   <str>20X Radonius Prime</str>
   <str>30X Radonius</str>
   <str>40X Radonius</str>
   <str>200X Radonius</str>
   <str>1000X Radonius Maximus</str>
   <str>Allegia 6R Clasteron</str>
   <str>Allegia 50 Clasteron</str>
   <str>Allegia 50B Clasteron</str>
   <str>Allegia 51 Clasteron</str>
   <str>Allegia 500 Clasteron</str>
   <str>Alpha 2</str>
   <str>Alpha 2A</str>
   <str>Alpha 2A-900</str>
   <str>Alpha 2A-8000</str>
   <str>Alpha 100</str>
   <str>Alpha 200</str>
   <str>Callisto Morphamax</str>
   <str>Callisto Morphamax 500</str>
   <str>Callisto Morphamax 600</str>
   <str>Callisto Morphamax 700</str>
   <str>Callisto Morphamax 5000</str>
   <str>Callisto Morphamax 6000 SE</str>
   <str>Callisto Morphamax 6000 SE2</str>
   <str>Callisto Morphamax 7000</str>
   <str>Xiph Xlater 5</str>
   <str>Xiph Xlater 40</str>
   <str>Xiph Xlater 50</str>
   <str>Xiph Xlater 58</str>
   <str>Xiph Xlater 300</str>
   <str>Xiph Xlater 500</str>
   <str>Xiph Xlater 2000</str>
   <str>Xiph Xlater 5000</str>
   <str>Xiph Xlater 10000</str>
</t>

Important assumption: This solution supposes that no number would have more than 40 digits. While this would be true in the prevailing number of practical cases, should there arise a case when this limit is insufficient, it is easy to modify this solution to accept the limit-value as an external/global parameter.

Finally, Performance:

Processing an XML document similar to the above, but having 1700 str elements takes 0.659 sec. on my 8-year old Pentium single core 3GHz CPU, 2GB RAM computer.

Explanation:

  1. This is a two-pass solution.

  2. In the first pass all nodes are copied "as-is" with the exception that a sortMe attribute is added to every str element. This attribute contains the string value of the only text-node child of str -- in which any number is left-padded with spaces to a total fixed length of 40.

  3. In Pass 2 we are sorting all str elements alphabetically using a single sort key -- the sortMe attribute.

Now, to answer all the 4 original questions:

My question is:

Is XSL the best sorting option for this particular problem?
If so how can I use XSL to do Natural Sort? (url to resources?)
If yes to both questions, which library should I use for the best compatibility and speed?
If XSL is not the best choice, then which one is?

Answers:

  1. Any implementation of an optimal sort algorithm (regardless of the language) should suffice. In this regards XSLT is a good choice.

  2. The code above provides a complete and exact XSLT implementation of "natural" sort.

  3. No library is necessary -- just use the above code as is. If you need assistance how to invoke a transformation from your PL, consult the appropriate documentation.

  4. Any PL, XSLT included, with an implementation of an optimal sorting algorithm is a suitable choice.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Thank you for taking this much time answering my question, I will tinker around with your example and adapt it to fit my XML structure - and I'll do more research about XSLT in my spare time. – Shaw Dec 02 '11 at 14:08
  • Old post, but I want to add: One reason NOT to use xslt, is that it does not support parameters in the 'xsl:sort select=expression' field. There are 'dynamic' solutions but those get complicated and don't seem to be consistent across browsers. I do see that this solution uses 2 passes to get around the fact it does not support a variable. Going to give that a shot. – ajk Dec 07 '21 at 19:37
  • @ajk, This is not true: ``` ``` – Dimitre Novatchev Dec 07 '21 at 21:49
  • Correctly pasted: ` ` Apply it on: `` – Dimitre Novatchev Dec 07 '21 at 22:06
  • @Dimitre, I agree your solution works. I am just disappointed in xslt 1.0. It shouldn't require 2 passes to implement a sort. – ajk Dec 08 '21 at 14:24
  • 1
    @ajk, The 2-passes are needed not for the sorting, but because in XSLT 1.0 the output collected into a variable is just an RTF (Result Tree Fragment) and not a regular tree, so this is converted into a regular tree/fragment by using an extension function . As pointed by others, Saxon.js is available in the browser and it implements XSLT 3.0. Since XSLT 2.0 the RTF type has been eliminated, so just use Saxon.js – Dimitre Novatchev Dec 08 '21 at 16:47
  • 1
    @ajk, See for example this, which gives anyone the ability to execute XSLT 3.0 transformations just in their browser. It uses Saxon.js: https://martin-honnen.github.io/xslt3fiddlePWA/ – Dimitre Novatchev Dec 09 '21 at 16:40
1

A couple of answers to ancillary questions:

(a) Sarissa is not an XSLT processor, it is a Javascript wrapper layer that provides a common Javascript API to the XSLT processor provided as part of the browser.

(b) xslt.js is a dead project that attempted to implement an XSLT processor in Javascript. Forget it, it's history.

A more recent effort in this direction is Saxon-CE, which is currently in alpha release (this is written in Java and cross-compiled to Javascript using GWT). When finished, this will give you XSLT 2.0 in the browser. Server-side Saxon has a collation that gives you 'natural sorting' (<xsl:sort collation='http://saxon.sf.net/collation?alphanumeric=yes'/>) but this is not available in the current version of Saxon-CE.

(P.S. I hadn't come across the name 'natural sorting' before. Thank you.)

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks for the helpful insight on Sarissa and XSLT.js, I'll look into using Sarissa as a wrapper since we ARE targeting more than one browser (though we thankfully don't have to worry about IE6), and it is helpful to be able to access native XSLT API in a consistent way. – Shaw Dec 02 '11 at 14:11
  • @Shaw of all the browsers, IE has supported XSL processing in browser since IE5(ish). It was only recently that Webkit and (finally) Gecko started to support it. Sarissa is the jQuery of XSL processor libraries. It works around all of the quirks of processing XSL within javascript. – Kevin Peno Dec 02 '11 at 16:09
0

The sort function is called more times than there are elements in the array being sorted, many more times actually. For your 1700 elements to sort, the comparison func probably gets called between 10,000 to 750,000 times depending on browser...Since your sort comparison function is slow, you stand to benefit a lot by doing the heavy lifting once per element and storing the result, and then sorting on the stored results.

I bet the main problem is you use jquery in your sort function. That's gotta be expensive. The actual natural sort comparison is probably relatively quick. I don't know your xml structure, but if you can ditch the jquery in the sort function, try to copy element references to a new array, which is linear time. Then you sort the array. Then, loop over the now sorted array and use the element references to set the order in the xml doc.

goat
  • 31,486
  • 7
  • 73
  • 96
  • @_chris: Seems you are describing my solution :) – Dimitre Novatchev Dec 02 '11 at 06:21
  • You are of course correct in that my JavaScript sort can use ALOT of optimization, however before I commited doing that I found out about XSL(T) and it seems that would be an even faster - since it's running on a compiled (browser-internal) parser, and a more appropriate method - since it is specifically designed for XML. – Shaw Dec 02 '11 at 14:16