13

I am using Linq to entities to query the database to get the list of int for further processing. I have two ways to get the list as below:

First is:

List<int> lstBizIds = new List<int>() { 1, 2, 3, 4, 5 };
List<int> lstProjectIds = context.Projects.Where(x => lstBizIds.Contains(x.businessId)).Select(x => x.projectId).ToList();

Second is:

List<int> lstBizIds = new List<int>() { 1, 2, 3, 4, 5 };
List<int> lstProjectIds = context.Projects.Join(lstBizIds, p => p.businessId, u => u, (p, u) => p.projectId).ToList();

Now my question is which one of the methods above is better performance wise? Also does it affect the performance if the first list i.e. lstBizIds grows in size? Suggest me other ways of implementation as well if that are performance reducing.

Girish Vadhel
  • 735
  • 1
  • 5
  • 17
  • 1
    Are you sure your first option works? Does the query execute and give you a result? – Sefe Nov 17 '16 at 09:26
  • yes, definitely it will work. – Girish Vadhel Nov 17 '16 at 09:29
  • 1
    Sorry, I meant the second option. Anyway, I assume you successfully tried both. – Sefe Nov 17 '16 at 09:30
  • Not performance related, but one other note when deciding between the two is that, in EF Core anyway, different connectors/providers escape characters differently when doing a `Where` `IN` vs a `JOIN`. This [just bit me](https://stackoverflow.com/questions/51930802/get-all-entities-from-dbsettentity-with-a-property-value-that-is-in-an-ienumer?noredirect=1&lq=1) with Pomelo MySQL for EF Core, but could be present in any provider I would think. – Collin Barrett Aug 20 '18 at 20:50

4 Answers4

6

You should go with Contains, because EF can produce a more efficient query.

This would be the SQL join:

SELECT Id
FROM Projects
INNER JOIN (VALUES (1), (2), (3), (4), (5)) AS Data(Item) ON Projects.UserId = Data.Item

This would be the SQL Contains:

SELECT Id
FROM Projects
WHERE UserId IN (1, 2, 3, 4, 5, 6)

IN is more efficient than JOIN because the DBMS can stop looking after the first match of the IN; the JOIN always finishes, even after the the first match.

You might also want to check which queries are actually sent to the DB. You always have to compare the SQL, not the LINQ code (obviously).

Sefe
  • 13,731
  • 5
  • 42
  • 55
  • 6
    You are right for the generated SQL (although the `join` for SqlServer looks differently) +1. But not for which one if more efficient - that really depends on database query plan optimizer. Most modern query optimizers would produce one and the same execution plan. – Ivan Stoev Nov 17 '16 at 09:42
  • 9
    Note that Contains is VERY slow in Entity Framework for relatively large sets (say > 1000 items). It is slow for internal reasons, not because the generated query is slow. So because of that - better never use Contains on large sets. – Evk Nov 17 '16 at 09:49
  • On the contrary, I recently run a test comparing Contains and Join, Contains is better than Join in every way. Contains is faster than Join in terms of query execution. And also faster in terms of SQL generation. Performing Join with 100K items will give you a StackOverflow exception during SQL generation. – Xiaoguo Ge Oct 12 '17 at 22:25
  • 1
    I decided to use the *join* syntax as it looks neater in the code but I've just found out the hard way that was a bad idea! The resultant SQL is huge (it creates a union for every element) and Entity Framework completely fails at runtime when the collection it is comparing is very large. Took me a long time to track down the cause of a live website failing. Replaced the Joins with Contains and all ok now – userSteve Jan 24 '19 at 11:18
2

Performing a join is quite efficient because Where condition actually performs a cartesian product of all the tables, then filters the rows that satisfy the condition. This means the Where condition is evaluated for each combination of rows (n1 * n2 * n3 * n4)

The Join operator takes the rows from the first tables, then takes only the rows with a matching key from the second table, then only the rows with a matching key from the third table, and so on. Secondly, contains would work in an iterative manner making it slower than join

g.005
  • 396
  • 1
  • 12
  • Yes, I also found that in one of the link on stack overflow: http://stackoverflow.com/questions/5551264/why-is-linq-join-so-much-faster-than-linking-with-where But also found contradictory explanation here: http://stackoverflow.com/questions/32650224/innerjoin-vs-contains-which-is-faster – Girish Vadhel Nov 17 '16 at 09:28
  • 2
    This applies to LINQ to Objects, but the question is about LINQ to Entities. – Ivan Stoev Nov 17 '16 at 09:38
  • The Join method internally uses indexes to join two result set but where uses a combination mechanism that makes it slower. The Join is much faster, because it knows how to combine to reduce the result to the relevant combinations. When you use Where to specify the relation, it has to create every possible combination which would end up making this slower. Hope this answers your question – g.005 Nov 17 '16 at 09:41
  • You probably are missing the part that one of the involved parties is a db table and the other - memory list. – Ivan Stoev Nov 17 '16 at 09:46
1

I just spent quite some time trying to find, what caused a stack overflow error in a program with a few simple LINQ queries accessing a mid size database.

for ICollection with ~10k elements on one side, and sql table on the other, a single change from "join" to "Contains" fixed the stack overflow error.

it seems that notwithstanding comparative performance, Contains is a safer choice.

I.Cekusins
  • 11
  • 1
  • 1
0


I choose the first,because it does not increase the computer's memory.
if you use both array to compare the conditions, choose from the second.