Using DISTINCT on a subquery to remove duplicates in Entity Framework

Question

I have question about use of Distinct with Entity Framework, using Sql 2005. In this example:

practitioners = from p in context.Practitioners
                join pn in context.ProviderNetworks on
                     p.ProviderId equals pn.ProviderId
                (notNetworkIds.Contains(pn.Network))
                select p;

practitioners = practitioners
                  .Distinct()
                  .OrderByDescending(p => p.UpdateDate); 

data = practitioners.Skip(PageSize * (pageOffset ?? 0)).Take(PageSize).ToList();

It all works fine, but the use of distinct is very inefficient. Larger result sets incur unacceptable performance. The DISTINCT is killing me. The distinct is only needed because multiple networks can be queried, causing Providers records to be duplicated. In effect I need to ask the DB "only return providers ONCE even if they're in multiple networks". If I could place the DISTINCT on the ProviderNetworks, the query runs much faster.

How can I cause EF to add the DISTINCT only the subquery, not to the entire resultset?

The resulting simplified sql I DON'T want is:

select DISTINCT p.* from Providers 
inner join Networks pn on p.ProviderId = pn.ProviderId
where NetworkName in ('abc','def')

IDEAL sql is:

select p.* from Providers 
inner join (select DISTINCT ProviderId from Networks 
            where NetworkName in ('abc','def')) 
as pn on p.ProviderId = pn.ProviderId

Thanks Dave

No, I think the ideal SQL should be `select * from Providers where ProviderId in (select ProviderId from Networks where NetworkName in ('abc','def'))` — the answer you received generates this SQL or something equivalent to it. Note that whether the subquery here has a `DISTINCT` on it or not makes no difference (at least not in MSSQL), they execute the same. — Timwi, May 18 '11 at 19:33
Thanks for the comment. It does matter, however. When asking for more than one network, the provider results duplicate, due to the one-to-many relationship. Further, doing a distinct on just ONE field (provider Id) is more efficient than a distinct on all the rows of the provider table. — DHickey, May 24 '11 at 21:55
What @Timwi means is that if the sub-query is in a `IN` or `EXISTS` statement having `DISTINCT` does not matter as duplicates are already ignored. — Magnus, May 24 '11 at 22:15

Magnus · Accepted Answer · 2011-05-18T19:40:10.330

6

I dont think you need a Distinct here but a Exists (or Any as it is called in Linq)

Try this:

    var q = (from p in context.Practitioners
            where context.ProviderNetworks.Any(pn => pn.ProviderId == p.ProviderId && notNetworkIds.Contains(pn.Network))
            orderby p.UpdateDate descending
            select p).Skip(PageSize * (pageOffset ?? 0)).Take(PageSize).ToList();

edited May 18 '11 at 19:40

answered May 18 '11 at 18:49

Magnus

45,362
8
80
118

Using DISTINCT on a subquery to remove duplicates in Entity Framework

1 Answers1

Linked