8

I've got a data set like this:

GroupName   GroupValue   MemberName   MemberValue
'Group1'    1            'Member1'    1
'Group1'    1            'Member2'    2
'Group2'    2            'Member3'    3
'Group2'    2            'Member4'    2
'Group3'    2            'Member5'    4
'Group3'    2            'Member6'    1

What I want to select is the rows that have the maximum MemberValue per GroupName, but only for those GroupNames that have the largest GroupValue, and pass them into a delegate function. Like this:

'Group2'    2            'Member3'    3
'Group3'    2            'Member5'    4

So far I've tried this format...

data.Where(maxGroupValue => 
    maxGroupValue.GroupValue == data.Max(groupValue => groupValue.GroupValue))
.Select(FunctionThatTakesData)

...but that just gives me every member of Group2 and Group3. I've tried putting a GroupBy() before the Select(), but that turns the output into an IGrouping<string, DataType> so FunctionThatTakesData() doesn't know what to do with it, and I can't do another Where() to filter out only the maximum MemberValues.

What can I do to get this data set properly filtered and passed into my function?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
JAF
  • 385
  • 1
  • 2
  • 12

3 Answers3

13

You can do that with the following Linq.

var results = data.GroupBy(r = r.GroupValue)
    .OrderByDescending(g => g.Key)
    .FirstOrDefault()
    ?.GroupBy(r => r.GroupName)
    .Select(g => g.OrderByDescending(r => r.MemberValue).First());

First you have to group on the GroupValue then order the groups in descending order by the Key (which is the GroupValue) and take the first one. Now you have all the rows with the max GroupValue. Then you group those on the GroupName and from those groups order the MemberValue in descending order and take the First row to get the row in each GroupName group with the max MemberValue. Also I'm using the C# 6 null conditional operator ?. after FirstOrDefault in case data is empty. If you're not using C# 6 then you'll need to handle that case up front and you can just use First instead.

With the addition of MaxBy in .Net 6 this can now be done as follows

var results = data.GroupBy(r = r.GroupValue)
    .MaxBy(g => g.Key)
    ?.GroupBy(r => r.GroupName)
    .Select(g => g.MaxBy(r => r.MemberValue));
juharr
  • 31,741
  • 4
  • 58
  • 93
  • Thank you! Now, suppose each grouping can have multiple members with the same value... should I just use a `Where()` in place of the `First()`? – JAF May 05 '17 at 13:24
  • 1
    In that case you'd want `.SelectMany(g => g.GroupBy(r => r.MemberValue).OrderByDescending(sg => sg.Key).First())` to replace the last line. – juharr May 05 '17 at 15:16
  • Genius! Thank you! – JAF May 05 '17 at 17:24
3

So basically what you want, is to divide your data elements into groups with the same value for GroupName. From every group you want to take one element, namely the one with the largest value for property MemberValue.

Whenever you have a sequence of items, and you want to divide this sequence into groups based on the value of one or more properties of the items in the sequence you use Enumerable.GroupBy

'GroupBy' takes your sequence as input and an extra input parameter: a function that selects which properties of your items you want to compare in your decision in which group you want the item to appear.

In your case, you want to divide your sequence into groups where all elements in a group have the same GroupName.

var groups = mySequence.GroupBy(element => element.GroupName);

What it does, it takes from every element in mySequence the property GroupName, and puts this element into a group of elements that have this value of GroupName.

Using your example data, you'll have three groups:

  • The group with all elements with GroupName == "Group1". The first two elements of your sequence will be in this group
  • The group with all elements with GroupName == "Group2". The third and fourth element of your sequence will be in this group
  • The group with all elements with GroupName == "Group3". The last two elements of your sequence will be in this group

Each group has a property Key, containing your selection value. This key identifies the group and is guaranteed to be unique within your collection of groups. So you'll have a group with Key == "Group1", a group with Key == "Group2", etc.

Besides the Key, every group is a sequence of the elements in the group (note: the group IS an enumerable sequence, not: it HAS an enumerable sequence.

Your second step would be to take from every group the element in the group with the largest value for MemberValue. For this you would order the elements in the group by descending value for property MemberValue and take the first one.

var myResult = mySequence.GroupBy(element => element.GroupName)
    // intermediate result: groups where all elements have the same GroupName
    .Select(group => group.OrderByDescending(groupElement => groupElement.MemberValue)
    // intermediate result: groups where all elements are ordered in descending memberValue
    .First();

Result: from every group ordered by descending memberValue, take the first element, which should be the largest one.

It is not very efficient to order the complete group, if you only want the element with the largest value for memberValue. The answer for this can be found here on StackOverflow

Harald Coppoolse
  • 28,834
  • 7
  • 67
  • 116
  • That gives you each `GroupName` with the max `MemberValue`. The OP wants only the rows with the max `GroupValue` and then each `MemberName` with the max `MemberValue`. – juharr May 05 '17 at 13:00
2

The easier way to solve this problem is to use the new (.NET 6) MaxBy LINQ operator, along with the GroupBy and Select operators:

IEnumerable<Record> query = records
    .GroupBy(x => x.GroupName)
    .Select(g => g.MaxBy(x => x.MemberValue));

This is an easy but not memory efficient solution. The reason is because it generates a full blown Lookup<TKey, TSource> structure under the hood, which is a dictionary-line container that contains all the records associated with each key. This structure is generated before starting to compare the elements contained in each grouping, in order to select the maximum element.

In most cases this inefficiency is not a problem, because the records are not that many, and they are already stored in memory. But if you have a truly deferred enumerable sequence that contains a humongous number of elements, you might run out of memory. In this case you could use the GroupMaxBy operator below. This operator stores in memory only the currently maximum element per key:

/// <summary>
/// Groups the elements of a sequence according to a specified key selector
/// function, and then returns the maximum element in each group according to
/// a specified value selector function.
/// </summary>
public static IEnumerable<TSource> GroupMaxBy<TSource, TKey, TValue>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TValue> valueSelector,
    IEqualityComparer<TKey> keyComparer = default,
    IComparer<TValue> valueComparer = default)
{
    // Arguments validation omitted
    valueComparer ??= Comparer<TValue>.Default;
    var dictionary = new Dictionary<TKey, (TSource Item, TValue Value)>(keyComparer);
    foreach (var item in source)
    {
        var key = keySelector(item);
        var value = valueSelector(item);
        if (dictionary.TryGetValue(key, out var existing) &&
            valueComparer.Compare(existing.Value, value) >= 0) continue;
        dictionary[key] = (item, value);
    }
    foreach (var entry in dictionary.Values)
        yield return entry.Item;
}

Usage example:

IEnumerable<Record> query = records
    .GroupMaxBy(x => x.GroupName, x => x.MemberValue);

The reverse GroupMinBy can be implemented similarly by replacing the >= with <=.

Below is a demonstration of the difference in memory-efficiency between the two approaches:

var source = Enumerable.Range(1, 1_000_000);
{
    var mem0 = GC.GetTotalAllocatedBytes(true);
    source.GroupBy(x => x % 1000).Select(g => g.MaxBy(x => x % 3333)).Count();
    var mem1 = GC.GetTotalAllocatedBytes(true);
    Console.WriteLine($"Allocated: {mem1 - mem0:#,0} bytes");
}
{
    var mem0 = GC.GetTotalAllocatedBytes(true);
    source.GroupMaxBy(x => x % 1000, x => x % 3333).Count();
    var mem1 = GC.GetTotalAllocatedBytes(true);
    Console.WriteLine($"Allocated: {mem1 - mem0:#,0} bytes");
}

Output:

Allocated: 8,571,168 bytes
Allocated: 104,144 bytes

Try it on Fiddle.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104