SQL "IN" statement in linq query mistake, how resolve?

Question

I have this query in SQL:

SELECT * 
FROM TableName
WHERE myData IN (SELECT MAX(myData) AS DATA_MAX  
                 FROM TableName 
                 GROUP BY id1, id2)

I want replicate it in Linq (c#) - how can I do that?

*why* do you want to convert it to LINQ? I never understand this obsession; there are great tools you can use to *very easily* execute that exact SQL from .NET code, which will be *much more efficient* than having to parse a LINQ expression tree, and will require zero effort to implement. Is there a reason you need it as LINQ? Are you interested in non-LINQ solutions? Also: what is the "mistake" alluded to in the question title? — Marc Gravell, Feb 08 '18 at 09:27
@MarcGravell is very right. Linq is a great thing but we dont need to shoehorn it into everything and its not always the most efficient way of doing it. — BugFinder, Feb 08 '18 at 09:33
This query is 100000% unsuitable for LINQ - it has nothing to do with Objects or Mapping. This is a *reporting* query. The query itself needs optimization - could you use `MAX() OVER(PARTITION BY id1,id2)` perhaps? Would the execution plan be better? What about extra indexes? Create a *View* and map to it, so you can optimize the query withour requiring recompilation and re-deployment — Panagiotis Kanavos, Feb 08 '18 at 09:53
*how can I do that* - Start by choosing an ORM that supports LINQ. — Gert Arnold, Feb 08 '18 at 09:53
Possible duplicate of [how to use SQL group to filter rows with maximum date value](https://stackoverflow.com/questions/24140048/how-to-use-sql-group-to-filter-rows-with-maximum-date-value) — Panagiotis Kanavos, Feb 08 '18 at 10:05
Chec [How to filter rows with max value](https://stackoverflow.com/questions/24140048/how-to-use-sql-group-to-filter-rows-with-maximum-date-value). You can improve the query a lot by using ROW_NUMBER(), eg `Select * from (select *,ROW_NUMBER() OVER (partition by id1,id2 order by mydata desc) as RN) T where RN=1`. You could even create a *view* using the inner query, allowing you to select the top N entries by `MyData` and map it to whatever `TableName` maps to — Panagiotis Kanavos, Feb 08 '18 at 10:07

score 2 · Answer 1 · answered Feb 08 '18 at 09:34

2

This isn't really a direct answer because it doesn't implement it via LINQ; but it does solve the problem, with the minimum amount of fuss:

You can use tools like "Dapper" to execute raw queries without involving any LINQ. If you're using something like LINQ-to-SQL or Entity Framework, the data-context there also usually has a raw query API that you can use, but I'm going to show a "Dapper" implementation:

class SomeType
{
    // not shown: properties that look like the columns
    // of [TableName] in the database - correct names/types
}
...
var data = connection.Query<SomeType>(@"
SELECT * FROM TableName
WHERE myData IN (Select max(myData) as DATA_MAX  from TableName group
by id1, id2)").AsList();

This approach makes it very easy to migrate existing SQL queries without having to rewrite everything as LINQ.

If you are using LINQ-to-SQL, DataContext has a similiar ExecuteQuery<TResult> method. Entity Framework has a SqlQuery method

answered Feb 08 '18 at 09:34

Marc Gravell

1,026,079
266
2,566
2,900

Better yet - create a view or function with that query. Reporting queries like this require quite a bit of optimization. Could it be rewritten with `MAX OVER(partition by id1,id2)` perhaps? Would that avoid the double table scans? One shouldn't have to redeploy the application to apply optimizations – Panagiotis Kanavos Feb 08 '18 at 09:55
@PanagiotisKanavos I agree with your thoughts on optimization; I disagree with your conclusion that a view is the right way to do that - it all depends on the context; in many scenarios, it may be more suitable to code the query into the application - especially if it is server-based and can be deployed with a single click - and can *reduce* deployment overhead while *increasing* testability. There are *also* scenarios in which a view is the **correct** way to deploy it to minimize deployment overhead. There is no "one solution" to this type of question. – Marc Gravell Feb 08 '18 at 10:14
It's not a *conclusion*. It's typically easier to test the query if it's in a view. For example, this query can be rewritten to add a ROW_NUMBER or RANK per row and return only those with RN=1. A hard-coded query string would require modifications and deployment of something unrelated to the query itself – Panagiotis Kanavos Feb 08 '18 at 10:19
@PanagiotisKanavos I genuinely have no cares about where the query lies; it would present me exactly zero challenge to develop and test it in either configuration. But by having it *not* a view, I can do things like: deploy to 1 server in the cluster and do a side-by-side live smoke test. It also means I can change the *shape* of the query - perhaps multiple result sets instead of a single "wide" result set; you can't do that with a view or a sproc, since it will break the consumer - but if the query is *part* of the consumer, it is trivial to iterate and deploy such refactored changes. – Marc Gravell Feb 08 '18 at 10:26
Different scenarios, different solutions. I could say the opposite. Right now I write a lot of ETL and reporting code. For ETL, views provide an abstraction layer that isolates the consumer from query changes or table refactorings. As long as the fields are the same, consumers don't know the difference. Which makes it *easier* to refactor, smoke test, compare execution plans between options. I don't have to redeploy an app/ETL script to fix a query. – Panagiotis Kanavos Feb 08 '18 at 10:36
Reporting on the other hand is 50-50. Sometimes, a common view is used in multiple reports. Sometimes, the query is specific to the report so adding a view means I have to deploy two things. *Except* when the query gets big enough that I need the view to simplify things. – Panagiotis Kanavos Feb 08 '18 at 10:38

jignesh patel · Answer 2 · 2018-02-08T09:43:25.733

0

You can try this. May be it will work.

var myData = (from c in _context.TableName
    group c by new
    {
        c.id1,
        c.id2
    } into gcs
    select new
    {
       gcs.Max(p=>p.myData)
    }).AsQueryable();

var result = (from t in _context.TableName
              where myData.Contains(t.myData)
              select t).ToList();

edited Feb 08 '18 at 09:43

answered Feb 08 '18 at 09:36

jignesh patel

964
7
13

I think the `ToList()` on the first query is a mistake, personally - it forces this into two queries rather than a single composed query, and for all we know the data volume of the first query could be significant. You *might* be able to simply remove the `ToList()` and have it all work via `IQueryable`. Also: you don't need to add "check my answer" (now deleted, not by me) every time you add an answer: the OP automatically gets a notification of an answer. – Marc Gravell Feb 08 '18 at 09:42
@MarcGravell IQueryable is nice option – jignesh patel Feb 08 '18 at 09:44
you don't need to *add* `AsQueryable()`; if the `_context.Table` *is* a query (`IQueryable`), then the result is already a query (for a different `T`). If it isn't aready a query: adding `AsQueryable()` won't make it one. – Marc Gravell Feb 08 '18 at 09:52
@jigneshpatel it's not - the query is *already* a Queryable. Whatever SQL query is generated though, it won't look like what was asked. It will probably be quite a bit worse – Panagiotis Kanavos Feb 08 '18 at 09:53
@PanagiotisKanavos (shares a knowing nod) - oh, the horrors that I've seen from LINQ queries and SQL generation :) Don't get me wrong: if you aren't very good at SQL, LINQ will probably be better at SQL than you - but for anyone with good SQL skills... the least said the better :) – Marc Gravell Feb 08 '18 at 09:55

score 0 · Accepted Answer · answered Feb 08 '18 at 10:30

Long story short - don't use LINQ, optimize the query and use a microORM like Dapper to map results to classes :

var query = "Select * "
            "from ( select *, " +
            "       ROW_NUMBER() OVER (partition by id1,id2 order by mydata desc) AS RN " +
            "       From TableName ) T " +
            "where RN=1";
var data = connection.Query<SomeType>(query);

LINQ isn't a replacement for SQL. ORMs in general aren't meant to write reporting queries like this one.

Reporting queries need a lot of optimization and usually have to change in production. You don't want to have to redeploy your application each time a query changes. In this case it's far better to create a view and map to it using a microOMR like Dapper.

This specific query could require two table scans, one to calculate the maximum per id1,id2 and one to find the rows with matching mydata. The intermediate data would have to be spooled into tempdb too. If mydata is covered by an index, it may not be such an expensive query. If it isn't, all the data will be scanned twice.

An alternative is to calculate the ranking of each row by mydata based on id1, id2. You can do this with one of the ranking functions like ROW_NUMBER, RANK, NTILE.

Select * 
from ( select *,
              ROW_NUMBER() OVER (partition by id1,id2 order by mydata desc) AS RN
       From TableName) T 
where RN=1

You can use that query directly with Dapper or create a view and map your entities to the view, not the table itself.

One option would be to crate a MyTableRanked view :

CREATE VIEW MyTableRanked AS
select *,
       ROW_NUMBER() OVER (partition by id1,id2 order by mydata desc) AS RN
From TableName

This would allow you to write :

var query="Select * from MyTableRanked where RN=@rank";
var data = connection.Query<SomeType>(query,new {rank=2});

Allowing you to return the top N records per ID1,ID2 combination

SQL "IN" statement in linq query mistake, how resolve?

3 Answers3