0

I am trying to find a way to sort a List as fast as possible. I know about "bucket sort" which I will use(I think?). But before doing that. I think I am looking for the fastest possible clean algorithm first, - and then use that in bucket sort?

The strings looks like below where I add 100.000 elements in a loop:

-0,awb/aje - ddfas/asa - asoo/qwa
-1,awb/aje - ddfas/asa - asoo/qwa
-2,awb/aje - ddfas/asa - asoo/qwa

So I want to sort in a descending order for the first comma delimited argument which is a double like: -0, -1, -2 etc..

I have tried 3 Approaches where only the Approach 1 actually sorts correctly. Approach 2 and 3 do not sort entirely correctly in a descending order numeric wise.

So the approach 1 which sorts correctly do this in 30 seconds. The thing is that I will have about 3 Million elements and not only 100.000 like in this example which then should take at least 900 seconds or more.

My question is how can we sort 100.000 or more correctly 3 million elements as fast as possible? Running: sortingtestBENCHMARKS() will show the results

    public void sortingtestBENCHMARKS()
    {
        List<String> dataLIST = new List<String>(); List<String> sortedLIST = new List<String>(); String resultString = ""; int index = 0;
        DateTime starttime = DateTime.Now; DateTime endtime = DateTime.Now; TimeSpan span = new TimeSpan();
        for (double i = 0; i < 100000; i+=1)
        {
            dataLIST.Add("-" + i + "," + "awb/aje" + " - " + "ddfas/asa" + " - " + "asoo/qwa");
        }
        dataLIST = shuffle(dataLIST);

        /*--------------------------------------------------------------------------*/
        //APPROACH 1: 30 seconds (Sorts correctly in descending order)
        starttime = DateTime.Now;
        dataLIST = sortLIST(dataLIST);
        endtime = DateTime.Now;
        span = endtime - starttime;
        resultString = "Approach 1: " + span.TotalSeconds;
        dataLIST = shuffle(dataLIST);

        /*--------------------------------------------------------------------------*/
        //APPROACH 2: 55 seconds (Sorts INcorrectly in descending order)
        starttime = DateTime.Now;
        for (int i = 0; i < dataLIST.Count; i++) 
        {
            index = sortedLIST.BinarySearch(dataLIST[i]);
            if (index < 0)
            {
                sortedLIST.Insert(~index, dataLIST[i]);
            }
        }
        endtime = DateTime.Now;
        span = endtime - starttime;
        resultString = resultString + "\nApproach 2: " + span.TotalSeconds;

        /*--------------------------------------------------------------------------*/
        //APPROACH 3: 2 seconds (Sorts INcorrectly in descending order)
        starttime = DateTime.Now;
        dataLIST.Sort(); //1.6 seconds
        endtime = DateTime.Now;
        span = endtime - starttime;
        resultString = resultString + "\nApproach 3: " + span.TotalSeconds;

        /*--------------------------------------------------------------------------*/

        MessageBox.Show("Elapsed Times:\n\n" + resultString);
    }
    List<String> sortLIST(List<String> theLIST)
    {
        System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("en-US");
        theLIST.Sort(new Comparison<String>((a, b) =>
        {
            int result = 0;
            double ad = 0;
            double bd = 0;
            NumberFormatInfo provider = new NumberFormatInfo();
            provider.NumberGroupSeparator = ",";
            provider.NumberDecimalSeparator = ".";
            provider.NumberGroupSizes = new int[] { 5 };
            ad = Convert.ToDouble(a.Replace("a", "").Replace("c", "").Split(',')[0], provider);
            bd = Convert.ToDouble(b.Replace("a", "").Replace("c", "").Split(',')[0], provider);
            if (ad < bd)
            {
                result = 1;
            }
            else if (ad > bd)
            {
                result = -1;
            }
            return result;
        }));
        return theLIST;
    }
    List<String> shuffle(List<String> list)
    {
        var randomizedList = new List<String>();
        var rnd = new Random();
        while (list.Count != 0)
        {
            var index = rnd.Next(0, list.Count);
            randomizedList.Add(list[index]);
            list.RemoveAt(index);
        }
        return randomizedList;
    }
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
Andreas
  • 1,121
  • 4
  • 17
  • 34
  • What are you trying to achieve? Can't you use Arrays.sort()/Collections.sort()/TreeSet and an appropriate Comparator (such as Comparator::reverseOrder) ? – NoDataFound Feb 20 '19 at 20:33
  • Yes Approach 1 in my code do use a sorting algorithm which uses a Comparator. But I am looking for if there is an even smarter/better solution as it takes 30 seconds. Using only the standard Collections.Sort() method takes only 2 seconds BUT it does not sort in a descending order entirely correct as we know. – Andreas Feb 20 '19 at 20:40
  • You should **always** include the tag for the language you are using! For once without a language the page will not highlight the code at all. – Bakuriu Feb 20 '19 at 20:42
  • (Yes you are right I added c# as tag for the question) – Andreas Feb 20 '19 at 20:44
  • You are comparing different things actually, that 'comparer' you wrote is doing lots of inefficent stuff. Also if you convert everything to double when sorting, why isnt the list double in the first place? – CSharpie Feb 20 '19 at 20:50
  • I see, the thing is that the string holds information for that first parameter in the string which is a double. So the information in the string has to follow along in the sort. I am not sure if it is possible to do in any other approach using 2 lists where one is only double but when sorting the double list, how will the string information keep track? – Andreas Feb 20 '19 at 20:59
  • [Natural Sort Order in C#](https://stackoverflow.com/q/248603/7444103). – Jimi Feb 20 '19 at 21:06
  • You should use the `Stopwatch` class for measuring time. – Rufus L Feb 20 '19 at 21:50
  • Here I was thinking about Java. Your comparator is indeed doing too many thing, and probably several times per object (each time the algorithm needs to compare a versus b); I would transform the String into something else prior to sorting and the `NumberFormatInfo` could probably be reused instead of creating each time you compare - but I don't know the specifics in C#. You could probably use a profiler to to determine where is the bottleneck. – NoDataFound Feb 20 '19 at 21:54
  • You also should make a copy of the original shuffled list for each test, since the order of shuffling may impact the performance of your sorting (just to be comparing apples to apples). – Rufus L Feb 20 '19 at 22:03

3 Answers3

1

It seems to me that you can just split your string on the , character, strip the - off the first item in the split array, and then use OrderBy on that result:

var sorted = dataLIST.OrderBy(i => double.Parse(i.Split(',')[0].TrimStart('-'))).ToList();

I made a copy of your code and then used the one working method you have and compared it to running an OrderBy on the split string method mentioned above. The OrderBy/Split method is a little more than 30 times faster.

public static void sortingtestBENCHMARKS()
{
    var dataLIST = new List<string>();

    // Create the list
    for (var i = 0; i < 100000; i ++)
    {
        dataLIST.Add("-" + i + "," + "awb/aje" + " - " + "ddfas/asa" + " - " + "asoo/qwa");
    }

    // Shuffle the list
    dataLIST = shuffle(dataLIST);

    // Make two copies of the same shuffled list
    var copy1 = dataLIST.ToList();
    var copy2 = dataLIST.ToList();

    // Use a stopwatch for measuring time when benchmark testing
    var stopwatch = new Stopwatch();

    /*--------------------------------------------------------------------------*/
    //APPROACH 1: 2.83 seconds (Sorts correctly in descending order)
    stopwatch.Start();
    copy2 = sortLIST(copy2);
    stopwatch.Stop();
    Console.WriteLine($"sortLIST method: {stopwatch.Elapsed.TotalSeconds} seconds");

    /*--------------------------------------------------------------------------*/
    //APPROACH 2: 0.09 seconds (Sorts correctly in descending order)
    stopwatch.Restart();
    copy1 = copy1.OrderBy(i => double.Parse(i.Split(',')[0].TrimStart('-'))).ToList();
    stopwatch.Stop();
    Console.WriteLine($"OrderBy method:  {stopwatch.Elapsed.TotalSeconds} seconds");

    // Ensure that the lists are sorted identically
    Console.WriteLine($"Lists are the same: {copy1.SequenceEqual(copy2)}");
}

Output

![enter image description here

Rufus L
  • 36,127
  • 5
  • 30
  • 43
  • WOW!!! This is completely amazing. Revolutionary :) I tested out the code just now and could produce the same results. So OrderBy/Split is so much faster. I can't thank you enough for this method Rufus! I will play around with this code now :) – Andreas Feb 20 '19 at 22:26
0
var sortedList = theLIST.OrderByDescending(s=>s);
gandalf
  • 451
  • 5
  • 18
0

You should optimize the inner loop as hard as you can, because most of the processing time is spent there. I suggest implementing the string tokenizer yourself, since you only need the first token and the string is very uniform. The second optimization you might have is multiply all the number with -1 so sorting the list in reverse order is straightforward. Something like this:

private static double getNumberFromString(String s){
    int posFirstComma=0;
    for (; posFirstComma<s.length() && s.charAt(posFirstComma)!=','; posFirstComma++);
    return Convert.toDouble(s.subString(0, posFirstComma)*(-1);
}
myData.sort(new Comparision<String>((a,b)=> getNumberFromString(a)-getNumberFromString(b));

I personally won't touch the sort algorithm in the library itself because it is already thoroughly optimized. Just optimize everything in the for loop.

Gnut
  • 541
  • 1
  • 5
  • 19
  • Shouldn´t this line be the same?: return Convert.ToDouble(s.Substring(0, s.IndexOf(','))) * -1; However the use code does not compile where getNumberFromString is underlined?: dataLIST.Sort(new Comparison((a, b) => getNumberFromString(a) - getNumberFromString(b))); – Andreas Feb 20 '19 at 21:38
  • Multiply with (-1) should not work as I will have both positive and negative doubles – Andreas Feb 20 '19 at 21:39
  • It does not compile because Comparator returns an int, whereas I'm returning the double. It is a small matter, because you can always compare the doubles with an if-clause and return 1 and -1 like in your code (I think you need to consider when they are equal too, in case your numbers are not unique). I just give you kind of pseudocode to demonstrate the idea. And yes s.indexOf() works too, sr forgot about that )). If they are not all negative then just sort by reverse order with some function in the lib. The slowness lies in your comparator there. It is unlikely cause of library sort code. – Gnut Feb 20 '19 at 21:50