1

I am kind of stuck again as I am unable to understand this.

So I have a class named CSVItem:

public class CSVItem
{
    public int SortedAccountNumber { get; set; }
    public DateTime Date { get; set; }
    public int SNO { get; set; }
    public string AccountNumber { get; set; }
    public double Value { get; set; }
    public int Year
    {
        get
        {
            if (Date.Month > MainWindow.fiscalMonth)
            {
                return Date.Year+1;
            }
            return Date.Year;
        }
    }
    public int StaticCounter { get { return 1; } }

    public CSVItem(string accNo, DateTime date, double value, int sNo)
    {
        Value = value;            
        Date = date;
        AccountNumber = accNo;
        SNO = sNo;
    }


}

I read a CSV, and I make a List of Type CSV Item with about 500k items. Then I try to sort using the default Order By method of the list, and try to return the list from the sorted collection. Here is the code:

List<CSVItem> items = new List<CSVItem>();

 // ---- some code to read csv and load into items collection

 List<CSVItem> vItems = items.OrderBy(r1 => r1.AccountNumber).ThenBy(r1 => r1.Date).ToList();

It is like taking forever to sort and then convert the collection back to a list. Well I have certainly tried loading about a million records previously and never had such -no response- from Linq Sorting ever and it is kind of driving me crazy. Any help or suggestion on where I can look for finding the problem?

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
Vikas
  • 795
  • 1
  • 4
  • 20
  • I don't know how much this will help, but you may want to cache the result of Year instead of doing the calculation each time. – skaz Jul 10 '15 at 10:25
  • @skaz But seems he did not use Year to sort.... – HarryQuake Jul 10 '15 at 10:27
  • @PaulZahra yes, you are correct. It is actually ToList() which is making it slower. If I don't do ToList() it is totally instant and wicked fast. But I need a list coz I need to loop through the records too for some operations. I am unable to find what can be the reason behind this slowness. If you guys can suggest me, I may be able to find a workaround or make it faster. – Vikas Jul 10 '15 at 10:37
  • @Skaz, Year is not being used at all for sorting so I dont know if it makes a difference at all. – Vikas Jul 10 '15 at 10:38
  • How long do you mean by "forever"? – theMayer Jul 10 '15 at 10:44
  • The ToList() is actually evaluating the query so slowdown is obvious tbh... Try using IEnumerable vItems = items.OrderBy(r1 => r1.AccountNumber).ThenBy(r1 => r1.Date) .. you can still iterate vItems... or try IOrderedEnumerable vItems = items.OrderBy(r1 => r1.AccountNumber).ThenBy(r1 => r1.Date) – Paul Zahra Jul 10 '15 at 10:44
  • Also, you may consider the use of `SortedSet` – theMayer Jul 10 '15 at 10:46
  • @Vikas using IOrderedEnumerable vItems = items.OrderBy(r1 => r1.AccountNumber).ThenBy(r1 => r1.Date) you wont have to call ToList() when ordering. – Paul Zahra Jul 10 '15 at 10:48
  • You can use [Sort](https://msdn.microsoft.com/es-es/library/w56d4y5z%28v=vs.100%29.aspx) instead of LINQ OrderBy – Nacho Jul 10 '15 at 10:50
  • using AsParallel worked. I have no idea why. But it did :) – Vikas Jul 10 '15 at 10:53
  • @Nacho Several tests already showed, Sort is slower than the OrderBy. – greenhoorn Jul 10 '15 at 11:07
  • Also, isn't it (i mean icomparable sorting) even slower when we have multiple columns to sort with. – Vikas Jul 10 '15 at 11:51

1 Answers1

2

You can use AsParallel() to your advantage.

List<CSVItem> vItems = items.AsParallel().OrderBy(r1 => r1.AccountNumber).ThenBy(r1 => r1.Date).ToList();

The question arised, if the parallelization of OrderBy() does have side-effects if it's followed by a ThenBy().

When does the AsParallel() split the IEnumerable? There are 2 possible answers. Let's take the given query:

items.AsParallel().OrderBy(x=>x.Age).ThenBy(x=>x.Size)

Option 1

The items get split, each part gets ordered by age, then by size and finally merge back into 1 list. Obviously not what we want.

Option 2

The items get split, each part gets ordered by age, the items merge back into 1 list. After that, the items get split again, ordered by size and merge back into 1 list. That's what we want.

I created a little example to check, which one is true.

using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    static void Main(string[] args)
    {
        List<TestItem> items = new List<TestItem>();
        List<TestItem> itemsNonParallel = new List<TestItem>();

        items.Add(new TestItem() { Age = 1, Size = 12 });
        items.Add(new TestItem() { Age = 2, Size = 1 });
        items.Add(new TestItem() { Age = 5, Size = 155 });
        items.Add(new TestItem() { Age = 23, Size = 42 });
        items.Add(new TestItem() { Age = 7, Size = 32 });
        items.Add(new TestItem() { Age = 9, Size = 22 });
        items.Add(new TestItem() { Age = 34, Size = 11 });
        items.Add(new TestItem() { Age = 56, Size = 142 });
        items.Add(new TestItem() { Age = 300, Size = 13 });

        itemsNonParallel.Add(new TestItem() { Age = 1, Size = 12 });
        itemsNonParallel.Add(new TestItem() { Age = 2, Size = 1 });
        itemsNonParallel.Add(new TestItem() { Age = 5, Size = 155 });
        itemsNonParallel.Add(new TestItem() { Age = 23, Size = 42 });
        itemsNonParallel.Add(new TestItem() { Age = 7, Size = 32 });
        itemsNonParallel.Add(new TestItem() { Age = 9, Size = 22 });
        itemsNonParallel.Add(new TestItem() { Age = 34, Size = 11 });
        itemsNonParallel.Add(new TestItem() { Age = 56, Size = 142 });
        itemsNonParallel.Add(new TestItem() { Age = 300, Size = 13 });

        foreach (var item in items.AsParallel().OrderBy(x => x.Age).ThenBy(x => x.Size))
        {
            Console.WriteLine($"Age: {item.Age}     Size: {item.Size}");
        }

        Console.WriteLine("---------------------------");

        foreach (var item in itemsNonParallel.OrderBy(x => x.Age).ThenBy(x => x.Size))
        {
            Console.WriteLine($"Age: {item.Age}     Size: {item.Size}");
        }

        Console.ReadLine();        
    }
}

public class TestItem
{
    public int Age { get; set; }
    public int Size { get; set; }
}

Result

AsParallel() does what we want. It first processes the OrderBy() parallel, merges back the list and then moves on to the next query, in our case ThenBy(). I tested this multiple times and always the same result.

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
greenhoorn
  • 1,601
  • 2
  • 15
  • 39
  • Thanks @greenhoorn. It worked. Can you explain the concept behind it? It is so fast... – Vikas Jul 10 '15 at 10:52
  • @Vikas I don't know how it works exactly, but the method call AsParallel() splits your original list into several parts. Each part is being processed by an own thread. On dual-core cpus the performance is doubled. But note, this only works for time-taking lists. For small lists AsParallel() is about 200 times slower! Because of the additional method call. – greenhoorn Jul 10 '15 at 11:06
  • 1
    way to go buddy, thanks for the valuable information. – Vikas Jul 10 '15 at 11:51
  • Btw this is no longer simple linq... it's now PLINQ .. See http://www.dotnetperls.com/asparallel .. also on single core cpus the performance will be roughly the same as without AsParallel()... on dual core it's doubled... on quad core + it may be improved further. – Paul Zahra Jul 10 '15 at 11:58
  • 1
    @PaulZahra Single-core cpus can't perform parallel tasks (normally). That's why there is no performance enhancement. I don't know a single person which is using less than 2 cores ;-) – greenhoorn Jul 10 '15 at 12:00
  • 1
    @greenhoorn indeed, I just put the info here for completeness. – Paul Zahra Jul 10 '15 at 12:02
  • @Vikas If I were you I'd check the resulting order to ensure it is as you want... run that check multiple times... the following implies that your orderby and thenby may be run in different orders... "The query partitions the source into tasks that are executed asynchronously on multiple threads. The order in which each task completes depends not only on the amount of work involved to process the elements in the partition, but also on external factors such as how the operating system schedules each thread." See https://msdn.microsoft.com/en-us/library/dd460714%28v=vs.110%29.aspx – Paul Zahra Jul 10 '15 at 12:03
  • @Vikas The above quote implies to me that for example AsParellel().Where would work fine... but AsParallel().OrderBy().ThenBy() may give different results depending on several factors. – Paul Zahra Jul 10 '15 at 12:06
  • @PaulZahra That's something I was thinking of too. Maybe silly, but I was assuming AsParallel() is processing only the method itself (i.e. OrderBy()) parallel and then moves on to the next method and does the same. Otherwise it would not be very useful per design. But could you please check this Vikas? – greenhoorn Jul 10 '15 at 12:12
  • @greenhoorn Indeed, it depends wholly on how it is splitting... e.g. if it simply splits the 'source' meaning the 500k records and orders them on seperate threads then merges them it's fine... but if it splits the 'source' meaning the orderby() and the thenby() issues could arise. – Paul Zahra Jul 10 '15 at 12:15
  • @greenhoorn see if anyone can answer the question outright... http://stackoverflow.com/questions/31340798/how-does-asparallel-split-its-source – Paul Zahra Jul 10 '15 at 12:29
  • Ok, so just to update. I had a CSV with about 200k rows, and I prepared my results using AsParallel and without AsParallel. Did lot of sum, max, min and sorting to prepare my results. And in my test, the results came same. – Vikas Jul 13 '15 at 10:38