1

I have a List of data that's combined from a entity framework database query with another IEnumerable of the same type with in memory data from other sources. For some of our clients this list amounts to about 200000 entries (about half from the db), which makes the grouping operating take extremely long (up to 30 minutes on our cheap virtual Windows server).

The grouping operation turns the list down to about 10000 objects (so about 20:1).

The data class of the List is basically just a big row of Strings and Ints and a few other basic types:

public class ExportData
{
  public string FirstProperty;
  public string StringProperty;
  public string String1;
  ...
  public string String27;
  public int Int1;
  ...
  public int Int15;
  public decimal Mass;
  ...
}

The grouping is done through a custom IEqualityComparer that basically amounts to this:

  1. If items are allowed to be grouped by the custom logic, that means about half of the properties of both objects are equal, and those are the only properties we care about from this point on, besides ID, Mass and a special StringProperty which can still be different even if the items are allowed to be grouped.
  2. Each new grouped object should have the relevant properties (that were the same in step 1), plus the combined IDs from the grouped items as a string and the Sum of all the Mass (decimal) properties of the grouped items, and the special StringProperty should be set depending on if a special string occurs in any of the grouped items or not.

List<ExportData> exportData; // in memory list of combined data from database + memory data

exportData = exportData.GroupBy(w => w, new ExportCompare(data)).Select(g =>
{
  ExportData group = g.Key;
  group.Mass = g.Sum(s => s.Mass);

  if (g.Count() > 1)
  {
    group.CombinedIds = string.Join("-", g.Select(a => a.Id.ToString()));
  }

  if (g.Any(s => s.StringProperty.Equals("AB"))) 
  {
    group.StringProperty= "AB";
  }
  else if (g.Any(s => s.StringProperty.Equals("CD"))) 
  {
    group.StringProperty= "CD";
  }
  else
  {
    group.StringProperty= "EF";
  }

  return group;
}).ToList();

And the custom comparer for completeness:

public class ExportComparer : IequalityComparer<ExportData>
{
  private CompareData data;

  public ExportComparer()
  {
  }
  public ExportComparer(CompareData comparedata)
  {
    // Additional data needed for comparison logic
    // prefetched from another database
    data = comparedata;
  }
  public bool Equals(ExportData x, ExportData y)
  {
    if (ReferenceEquals(x, y)) return true;

    if (ReferenceEquals(x, null) || ReferenceEquals(y, null)) return false;

    (...) // Rest of the unit-tested and already optimized very long comparison logic
    return equality; // result from the custom comparison
  }

  public int GetHashCode(ExportData obj)
  {
    if (ReferenceEquals(obj, null)) return 0;

    int hash = 17;

    hash = hash * 23 + obj.FirstProperty.GetHashCode();
    (...) // repeated for each property used in the comparison logic
    return hash;

What can I do to make this groupby run faster?

Aether McLoud
  • 732
  • 1
  • 7
  • 20
  • Do you have enough RAM for 200K entries to not use HDD for temporary space? If your comparison logic is already optimized, then there is little you can do except improve hardware. – nvoigt Apr 17 '18 at 10:22
  • Why are you making a new `ExportComparer` for each element? Especially considering the `Equals` method doesn't depend on `CompareData` – Felix Castor Apr 17 '18 at 10:26
  • Why not just use Stored procedure and let SQL server do the optimisation for grouping data and then lazy load the results? This way your application server won't be hammered and SQL server will do what it is designed to i.e work on such heavy lifting tasks. – AKrishan Apr 17 '18 at 10:32
  • So to clarify, grouping of 200k items _in memory_ takes 30 minutes? – Evk Apr 17 '18 at 10:33
  • @nvoigt RAM usage on the (rented, virtual) server never goes above 80% but I'm not sure if all that is actually real RAM hardware. I was hoping the .select into new could be optimized. – Aether McLoud Apr 17 '18 at 10:45
  • @FelixCastor mostly because this was my first customcomparer and that's what I saw in examples, but I just changed it to using an instanced class and it doesn't speed it up noticeably. – Aether McLoud Apr 17 '18 at 10:47
  • Does `Distinct(new ExportComparer())` instead of `GroupBy` perform equally bad? – Vidmantas Blazevicius Apr 17 '18 at 10:49
  • @AKrishan because not all data in the list comes from the database, and even if I temporarily stored the additional data in the db, sadly I'm the only dev here and I wouldn't know where to start with a stored procedure since the combination logic is really complicated and queries a lot of other tables and views. – Aether McLoud Apr 17 '18 at 10:50
  • 1
    Similar to what @VidmantasBlazevicius suggested, try eliminating the `Select` cost by removing it and check the time for doing `var grouped = exportData.GroupBy(w => w, new ExportCompare(data)).ToList();`. If it is slow, then the problem is in your comparer. Which I suspect is the case, because such times with memory structures usually indicate quadratic time complexity algorithms. – Ivan Stoev Apr 18 '18 at 07:29
  • @AetherMcLoud was Select the reason after all, or custom comparer? – Evk Apr 20 '18 at 14:39
  • @Evk the custom comparer was a huge culprit, got it down to 1-2 minutes with changes there, though the select still shaved off some further 5-10 seconds so I marked that post as an answer. – Aether McLoud Apr 20 '18 at 16:18

2 Answers2

2

It's hard to suggest optimization for comparer, since its code is not shown, but there is an optimization for Select clause.

Right now you are using Sum, Count, Select, Any (2 times) inside that select. That means elements in each group are evaluated 5 times (from them at least 3 times completely). Instead you can use a foreach loop, once, and evaluate your conditions yourself:

exportData.GroupBy(w => w, new ExportCompare(data)).Select(g =>
{                
    ExportData group = g.Key;
    decimal mass = 0m;
    var ids = new List<int>();
    bool anyAb = false;
    bool anyCd = false;
    // only one loop
    foreach (var item in g) {
        mass += item.Mass;
        ids.Add(item.Id);
        anyAb = anyAb || item.StringProperty.Equals("AB");
        anyCd = anyCd || item.StringProperty.Equals("CD");
    }
    group.Mass = mass;
    if (ids.Count > 1) {
        group.CombinedIds = string.Join("-", ids);
    }
    if (anyAb)
        group.StringProperty = "AB";
    else if (anyCd)
        group.StringProperty = "CD";
    else
        group.StringProperty = "EF";

    return group;
}).ToList();

Now we loop over grouping just once, which should be more efficient that doing that 5 times.

Evk
  • 98,527
  • 8
  • 141
  • 191
0

g.Count() > 1 could be optimized to g.Any() because you don't really care for the count, you only care if there is at least one element.

Your Any/Any calls for AB or CD could be handled in one loop instead of two.

You might want to try to make a List or Array from your g but that's more of a guess, depending on what happens internally, and how your groups are built, that might do good or harm. You will need to test and profile.

However, I strongly suspect something else is fishy here, either your RAM is exhausted or there is need to optimize code you haven't shown. 30 minutes to do some memory work is insane.

nvoigt
  • 75,013
  • 26
  • 93
  • 142
  • I care if there's 2 or more items in that count line, so I can't really use any there. The CustomComparer does a few lookups itself on memory Lists in the CompareData object given to it in the constructor. Besides that it's just lots of boolean logic in about 20 different cases depending on the input properties. I can try to update the question with the relevant list lookups in the equals function. – Aether McLoud Apr 17 '18 at 11:01