3

I have a class Event that have two properties : "ID", and "ExpirationTime". I have a list that have many events, some of them with the same ID. I want to create an efficient LINQ query that will distinct the events by the ID, and for each ID keep the event with the smallest ExpirationTime.

Thanks!

Guillaume86
  • 14,341
  • 4
  • 53
  • 53
Adibe7
  • 3,469
  • 7
  • 30
  • 36

7 Answers7

4

The grouping is easy enough, but doing an efficient "MinBy" with standard LINQ to Objects is slightly messy:

var lowestByID = items.GroupBy(x => x.ID)
                      .Select(group => group.Aggregate((best, next) =>
                                   best.ExpirationTime < next.ExpirationTime 
                                  ? best : next));

It's cleaner with a MinBy operator, such as the one provided with MoreLinq.

var lowestByID = items.GroupBy(x => x.ID)
                      .Select(group => group.MinBy(x => x.ExpirationTime));
Ani
  • 111,048
  • 26
  • 262
  • 307
  • @LukeH: I might be wrong, but I think I picked up the aggregate trick for `O(n)` MinBy from one of *your* answers. :) – Ani Feb 09 '11 at 15:55
  • @Ani: Maybe. I was just about to post the same answer here, but you beat me to it ;) – LukeH Feb 09 '11 at 15:56
  • 3
    Using Aggregate to implement Max, very clever ! – Ohad Schneider Feb 09 '11 at 15:57
  • @ohadsc: It is, but I can't claim credit for it; I picked up the idea from elsewhere on SO. :) – Ani Feb 09 '11 at 15:58
  • @Ani That's OK, I am already under the assumption that every answer in SO originates in one of Jon Skeet's post :) – Ohad Schneider Feb 09 '11 at 16:00
  • 2
    @David B: It's the most efficient in the sense that it's O(N); however, I don't think it's actually the *most* efficient way. See my answer for a more efficient (but less "clean"-looking) approach. – Dan Tao Feb 09 '11 at 17:38
3

LINQ's Distinct() on a particular property

Simple! You want to group them and pick a winner out of the group.

List<Event> distinctEvents = allEvents
   .GroupBy(e => e.Id)
   .Select(g => g.OrderBy(e => e.ExpirationTime).First())
   .ToList(); 
Community
  • 1
  • 1
Amy B
  • 108,202
  • 21
  • 135
  • 185
  • 1
    Nice! However note that sorting is o(nlogn) whereas max is o(n) – Ohad Schneider Feb 09 '11 at 15:53
  • @ohadsc You are correct. I'm deliberately trading away a little performance for ease of use/read. Also - one would expect each group to be quite a bit smaller than the total list, so these mini- orderings are faster than ordering the whole list. – Amy B Feb 09 '11 at 15:56
3

I believe this should outperform the GroupBy suggestion (see brief explanation below):

IEnumerable<Event> DistinctEvents(IEnumerable<Event> events)
{
    var dict = new Dictionary<int, Event>();

    foreach (Event e in events)
    {
        Event existing;
        if (!dict.TryGetValue(e.Id, out existing) || e.ExpirationTime < existing.ExpirationTime)
        {
            dict[e.Id] = e;
        }
    }

    foreach (Event e in dict.Values)
    {
        yield return e;
    }
}

Explanation: While this and the GroupBy method proposed by Ani have the same algorithmic complexity (as far as I can tell, anyway), the above approach is more efficient in practice for two reasons.

  1. GroupBy internally uses a Lookup<TKey, TValue> (very similar to a Dictionary<TKey, List<TValue>>) which actually populates internal collections with the contents of the input sequence. This requires more memory and also has a performance impact, particularly due to the fact that while the sub-collections will have amortized O(1) insertion time, they will occasionally need to resize themselves, which will be O(N) (where N is the size of the sub-collection). This is not a big deal, but it's still a lot more work you really need to be doing.
  2. A consequence of point #1 is that this in turn requires iterating over each element in the input sequence before GroupBy can provide an enumerator (so it's deferred execution, but then the entire input sequence needs to be iterated before iterating over the result of GroupBy). Then you're iterating over each group again in the call to Aggregate; so in all, you're iterating over the elements in the input sequence twice, which is more times than necessary to accomplish the task at hand.

As I said, the algorithmic complexity is the same, which means the two approaches should be equally scalable; this one is simply faster. I took the liberty of testing both approaches (out of curiosity, mostly) and found the above to execute in roughly half the time and cause fewer GC collections (a rough approximation of memory usage) than the GroupBy approach.

These are minute concerns, which it would normally be a waste of time to think too much about. The only reason I mention them is that you asked for an efficient solution (and even bolded the term); so I figured you would want to take these kinds of factors into consideration.

Community
  • 1
  • 1
Dan Tao
  • 125,917
  • 54
  • 300
  • 447
  • +1 Nice, that's a lot of effort; benchmarking and all. (This is one of the issues with the 'information pipeline' in LINQ to Objects, the operators don't have big-picture knowledge, so the entire query cannot be optimized on that basis) – Ani Feb 09 '11 at 17:42
  • @Ani: Yeah, and to be fair I see that the OP *did* specifically ask for a "LINQ query"; my answer doesn't really fit that description. I always find it a *little* odd, though, when developers seek to find the most "efficient" solution to a problem and add the requirement that it must use LINQ (kind of like "I want the best tool for this job, and that tool must be a hammer"). As for the benchmarking, it's something I do so often I just have a little sandbox project with all the benchmarking tools included; essentially I pop in delegates and see how they perform over a bunch of iterations. – Dan Tao Feb 09 '11 at 17:51
  • @Ani: ...which isn't to say that I *don't* spend too much time on SO (I clearly do)! – Dan Tao Feb 09 '11 at 17:52
  • @Dan: And there's no reason that you couldn't make this method into a generic `PartitionedMinBy` extension method that could be usable in a LINQ query. Accept `partitionKeySelector` and `compareKeySelector` delegates as arguments and away you go... – LukeH Feb 09 '11 at 18:02
  • @LukeH: You're totally right; I guess I just felt that this was such specialized behavior that a generic version might be more trouble than it's worth (the ol' YAGNI principle). For instance I would think an *ideal* generic version, in addition to two selector functions, would also accept an optional `IEqualityComparer` for the key selector and an `IComparer` for the value selector. And I doubt it would get used much. But you're right that it's definitely doable. – Dan Tao Feb 09 '11 at 18:07
  • @Dan: Agreed. If this were my problem then I'd probably go with Ani's suggestion of `GroupBy`/`MinBy` in most situations: a nice trade-off between performance and readability. And if I then found that I needed more performance I'd go for something similar to your method above, as it is. – LukeH Feb 09 '11 at 18:12
2

Assuming you can implement IComparable on your Event class (since LINQ's Min doesn't have an overload returning the original item otherwise), you can do:

var distinct = events.GroupBy(evt => evt.Id).Select(grp => grp.Min());

Example:

void Main()
{
    var events = new List<Event>
    {
        new Event(1, DateTime.Now),
        new Event(1, DateTime.Now.AddDays(1)),
        new Event(2, DateTime.Now.AddDays(2)),
        new Event(2, DateTime.Now.AddDays(-22)),
    };

    var distinct = events.GroupBy(evt => evt.Id).Select(grp => grp.Min());
}

public class Event : IComparable<Event>
{
    public Event(int id, DateTime exp)
    {
        Id = id;
        Expiration = exp;
    }
    public int Id {get; set;}
    public DateTime Expiration {get; set;}

    public int CompareTo(Event other)
    {
        return Expiration.CompareTo(other.Expiration);
    }
}
Ohad Schneider
  • 36,600
  • 15
  • 168
  • 198
1

I think this should do it:

events.GroupBy(x => x.ID, (key, items) => items.First(y => y.ExpirationTime == items.Min(z => z.ExpirationTime)))

Will group by ID, selecting as the result the event in items (where items represents all the events with the same ID) with the smallest ExpirationTime.

Kirk Woll
  • 76,112
  • 22
  • 180
  • 195
  • it will not Distinct, because 1) Where produces IEnumerable, so you have to flatten by SelectMany 2) Where can include several Events that have same ExpirationDate – Andrey Feb 09 '11 at 15:51
1
events.GroupBy(e => e.ID).Select(g => new { ID = g.Key, Time = g.Min(e => e.ExpirationTime) });
Itay Karo
  • 17,924
  • 4
  • 40
  • 58
0
        List<Event> events = null;
        events
            .GroupBy( e => e.ID )
            .Select( g =>
                g.First( e => 
                    e.ExpirationTime == g.Max( t =>
                        t.ExpirationTime
                    )
                )
            );
Andrey
  • 59,039
  • 12
  • 119
  • 163