9

I have a huge list of Items and need to Group them by one property. Then the oldest of each group should be selected.

Simplified Example: Select the oldest User of each FirstName.

using (ED.NWEntities ctx = new ED.NWEntities())
{
    IQueryable<ED.User> Result = ctx.User.GroupBy(x => x.FirstName)
                                    .Select(y => y.OrderBy(z => z.BirthDate)
                                    .FirstOrDefault())
                                    .AsQueryable();
}

Class User:

public partial class User
{
    public int UserID { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public Nullable<System.DateTime> BirthDate { get; set; }
}

I was wondering why this statement took so long until I set a breakpoint at Result and looked into the SQL statement generated:

{SELECT
`Apply1`.`UserID`, 
`Apply1`.`FIRSTNAME1` AS `FirstName`, 
`Apply1`.`LastName`, 
`Apply1`.`BirthDate`
FROM (SELECT
`Distinct1`.`FirstName`, 
(SELECT
`Project2`.`UserID`
FROM `User` AS `Project2`
 WHERE (`Distinct1`.`FirstName` = `Project2`.`FirstName`) OR ((`Distinct1`.`FirstName` IS  NULL) AND (`Project2`.`FirstName` IS  NULL))
 ORDER BY 
`Project2`.`BirthDate` ASC LIMIT 1) AS `UserID`, 
(SELECT
`Project2`.`FirstName`
FROM `User` AS `Project2`
 WHERE (`Distinct1`.`FirstName` = `Project2`.`FirstName`) OR ((`Distinct1`.`FirstName` IS  NULL) AND (`Project2`.`FirstName` IS  NULL))
 ORDER BY 
`Project2`.`BirthDate` ASC LIMIT 1) AS `FIRSTNAME1`, 
(SELECT
`Project2`.`LastName`
FROM `User` AS `Project2`
 WHERE (`Distinct1`.`FirstName` = `Project2`.`FirstName`) OR ((`Distinct1`.`FirstName` IS  NULL) AND (`Project2`.`FirstName` IS  NULL))
 ORDER BY 
`Project2`.`BirthDate` ASC LIMIT 1) AS `LastName`, 
(SELECT
`Project2`.`BirthDate`
FROM `User` AS `Project2`
 WHERE (`Distinct1`.`FirstName` = `Project2`.`FirstName`) OR ((`Distinct1`.`FirstName` IS  NULL) AND (`Project2`.`FirstName` IS  NULL))
 ORDER BY 
`Project2`.`BirthDate` ASC LIMIT 1) AS `BirthDate`
FROM (SELECT DISTINCT 
`Extent1`.`FirstName`
FROM `User` AS `Extent1`) AS `Distinct1`) AS `Apply1`}

Question: Is there a way to solve his more efficient? Sub-selects are expensive and EF generates one per column. I use mySQL .NET Connector version 6.9.5.0

fubo
  • 44,811
  • 17
  • 103
  • 137

6 Answers6

3

Using Jon Skeet's answer on distinct..

public static IEnumerable<TSource> DistinctBy<TSource, TKey>
    (this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
    HashSet<TKey> seenKeys = new HashSet<TKey>();
    foreach (TSource element in source)
    {
        if (seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

You can try:

using (ED.NWEntities ctx = new ED.NWEntities())
{
    IQueryable<ED.User> Result = ctx.User.OrderBy(y => y.BirthDate)
                                    .DistinctBy(z => z.FirstName)
                                    .AsQueryable();
}
Community
  • 1
  • 1
jegtugado
  • 5,081
  • 1
  • 12
  • 35
  • 3
    but this one fetches the whole table from the database and treats the data afterwards? – fubo Apr 22 '16 at 07:11
  • Fetching flat data first and then doing the grouping in memory might be a valid approach to overcome MySqls's tendency to choke in sub queries. – Gert Arnold Apr 22 '16 at 08:15
2

You may try to do something more approaching the way you would do it in sql (without a "row_number like" function)... and see what's generated.

var maxAges = ctx.User.GroupBy(x => x.FirstName)
                      .Select(g => new {
                         firstName = g.Key,
                         maxAge = g.Min(x => x.BirthDate)
                      });
var result = from u in ctx.User
             join a in maxAges on new{f = u.FirstName, b =u.BirthDate} equals new{f = a.firstName, b =a.maxAge}
             select u;

(mixing fluent and query syntax, as I find query syntax clearer for joins, but... that's just a personal point of view)

Raphaël Althaus
  • 59,727
  • 6
  • 96
  • 122
  • Is there a way to get the ID into joining? That would avoid invalid values in case `BirthDate` + `FirstName` are duplicate – fubo Apr 22 '16 at 07:57
  • Yes you can include *ID* as well, we could do a groupby by *FirstName + Birthdate + ID* – bit Apr 22 '16 at 08:03
  • @fubo You would just take a random one in this case ? You may do your first query in the enumerated result (as you should not have too many results back from db), but that's not really nice (but... that's optimization) – Raphaël Althaus Apr 22 '16 at 08:04
  • @bit That would not return the same result as the OP's query – Raphaël Althaus Apr 22 '16 at 08:05
1

You are grouping them first and then ordering each subquery. Of course, it would be slow.

Try to order the table first, so you only have to do this once. And then group them and take the first.

IQueryable<ED.User> Result = ctx.User
    .OrderBy(x => x.BirthDate)
    .GroupBy(x => x.FirstName, (k,g) => g.FirstOrDefault())
    .AsQueryable();
fubo
  • 44,811
  • 17
  • 103
  • 137
Xiaoy312
  • 14,292
  • 1
  • 32
  • 44
1

I'm quite sure that when you are using mySQL you are able to create a GROUP BY clause which differs to your SELECT statement. With other words the rows that you are selecting must not be part of the aggregation function. So a query like this should work:

SELECT
      FirstName
      ,LastName
      ,BirthDate
  FROM Users
  GROUP BY FirstName
  ORDER BY BirthDate

Please try this in your mySQL Query Browser. Than you can use this query directly with you entity framework context like this:

string query = ".."; // the query above

var res = context.Database.SqlQuery<Users>(query).ToList();
t2t
  • 1,101
  • 2
  • 12
  • 15
0

Looking at this, your previous and some other questions (like this), looks like using EF with MySQL is a pain.

You can eventually try this LINQ query

var query = db.User.Where(user => !db.User.Any(
    u => u.UserID != user.UserID && u.FirstName == user.FirstName &&
    (u.BirthDate < user.BirthDate || (u.BirthDate == user.BirthDate && u.UserID < user.UserID))));

which generates this simple SQL query

SELECT
`Extent1`.`UserID`, 
`Extent1`.`FirstName`, 
`Extent1`.`LastName`, 
`Extent1`.`BirthDate`
FROM `Users` AS `Extent1`
 WHERE NOT EXISTS(SELECT
1 AS `C1`
FROM `Users` AS `Extent2`
 WHERE ((`Extent2`.`UserID` != `Extent1`.`UserID`) AND (`Extent2`.`FirstName` = `Extent1`.`FirstName`)) AND ((`Extent2`.`BirthDate` < `Extent1`.`BirthDate`) OR ((`Extent2`.`BirthDate` = `Extent1`.`BirthDate`) AND (`Extent2`.`UserID` < `Extent1`.`UserID`))))

although I'm not sure what will be the performance impact.

Community
  • 1
  • 1
Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
  • this is as slow as the query from my quesiton. probably i need to add some indexes – fubo Apr 25 '16 at 13:04
  • Fair enough. I had that feeling. Making `FirstName` required and creating index on it might help. Other than that, I don't see any viable query construct, probably you should consider in memory approaches. – Ivan Stoev Apr 25 '16 at 14:23
0

You'll need indexes and this doesn't guarantee the best performance because the EF generated query will most likely be a large nested subqueries.

if performance is still a problem, you can return the user ID of the oldest for each group and run another query to get the User object.

Worse case, use inline sql, a view, or stored proc.

since i don't use Mysql and I don't what indexes you have, I'll leave this task for you.

  var oldestUsers = (from u in users
                       group u by u.FirstName into grp 
                       select new {
                           grp.Key,
                           oldestUser = (from u in grp
                                         orderby u.BirthDate descending
                                         select u).First()
                       }).ToList();

    foreach (var u in oldestUsers)
    {
        Console.WriteLine("{0} {1:D}", u.oldestUser.FirstName, u.oldestUser.BirthDate);
    }
dfdsfdsfsdf
  • 653
  • 3
  • 7