Remove duplicate entities based on multiple properties using SQL/LINQ

Question

I have two entities, Etf and DataPoint. Each Etf has multiple DataPoints. See structure below

public class Etf
{
    public Etf() 
    { 
        DataPoints = new HashSet<DataPoint>(); 
    }

    public string Id { get; set; }
    public ICollection<DataPoint> DataPoints { get; private set; }
}

public class DataPoint { 
    public string EtfId { get; set; }
    public Etf Etf { get; set; }
    public DateTime Date { get; set; }

}

It happens that multiple datapoints with the same EtfId and Date are inserted into the database. I would like to remove the duplicates based on these two fields.

In SQL, I have tried this:

WITH CTE AS 
    (SELECT Id, ROW_NUMBER() OVER 
       (PARTITION BY EtfId, Date ORDER BY EtfId, Date DESC) 
       AS ROW_NUMBER FROM DataPoints) 
DELETE FROM CTE WHERE ROW_NUMBER > 1;

Which gives me an error that CTE is not an updatable table (perhaps since it is connected to both Etf as well as DataPoint). I then tried, based on this answer, the following in LINQ:

// Get the duplicate indices
var idxs = await _context.DataPoints
                    .GroupBy(g => new {g.EtfId, g.Date})
                    .OrderBy(g => g.Key.EtfId)
                    .ThenByDescending(g => g.Key.Date)
                    .SelectMany(g =>
                        g.Select((i, idx) => new { i.Id, Idx = idx }))
                    .Where(g => g.Idx > 0)
                    .Select(g => g.Id)
                    .ToListAsync(cancellationToken);

// Get the duplicate entities from indices
var duplicates = await _context.DataPoints
                    .Where(x => idxs.Contains(x.Id))
                    .ToListAsync(cancellationToken);

// Remove them
_context.DataPoints.RemoveRange(duplicates);

However, this approach gives me a System.InvalidOperationException at the Select((i, idx) => ..) statement saying it may be a bug or limitation from EF Core.

Is there any better (or just working) method that I can use?

score 0 · Answer 1 · answered Jun 18 '20 at 16:57

I managed to find a - hideous - solution with SQL:

await _context.Database
                .ExecuteSqlRawAsync("DELETE FROM DataPoints WHERE Id IN " +
                                    "(SELECT Id FROM " +
                                    "(WITH c AS (SELECT Id, ROW_NUMBER() " +
                                    "OVER (PARTITION BY EtfId, Date ORDER BY EtfId, Date DESC) AS rn FROM DataPoints)" +
                                    "SELECT * FROM c WHERE rn > 1) AS t);");

Please do not tell anyone about this post.

MarcG · Answer 2 · 2020-06-19T17:09:19.893

... and this is possibly a worse solution in c#/linq. It will remove all except one of the duplicate datapoints, datapoints with Id = 2 and 5.

void Main()
{
    var datapoints = new List<Datapoint>();
    datapoints.Add(new Datapoint() { Id = 1, EtfId = "A", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 2, EtfId = "A", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 3, EtfId = "A", EtfDate = new DateTime(2020, 6, 2) });
    datapoints.Add(new Datapoint() { Id = 4, EtfId = "B", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 5, EtfId = "B", EtfDate = new DateTime(2020, 6, 1) });
    var duplicates = datapoints
        .GroupBy(g => new { g.EtfId, g.EtfDate })
        .Select(g => new { g.Key.EtfId, g.Key.EtfDate, Count = g.Count() })
        .Where(g => g.Count > 1);
    var datapointsToremove = new List<int>();
    foreach (var d in duplicates)
    {
        var removeDataPoints = datapoints
            .Where(x => x.EtfId == d.EtfId && x.EtfDate == d.EtfDate).Skip(1)
            .Select(x => x.Id);
        foreach (var rd in removeDataPoints)
        {
            datapointsToremove.Add(rd);
        };
    };
    foreach (var dtr in datapointsToremove)
    {
        datapoints.Remove(datapoints.FirstOrDefault(d => d.Id == dtr));
    }
}
public class Datapoint
{
    public int Id { get; set; }
    public string EtfId { get; set; }
    public DateTime EtfDate { get; set; }
}

And then there is this inspired by var distinctItems = items.GroupBy(x => x.Id).Select(y => y.First()); see Remove duplicates in the list using linq

void Main()
{
    var datapoints = new List<Datapoint>();
    datapoints.Add(new Datapoint() { Id = 1, EtfId = "A", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 2, EtfId = "A", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 3, EtfId = "A", EtfDate = new DateTime(2020, 6, 2) });
    datapoints.Add(new Datapoint() { Id = 4, EtfId = "B", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Add(new Datapoint() { Id = 5, EtfId = "B", EtfDate = new DateTime(2020, 6, 1) });
    datapoints.Dump();
    var distinctItems = datapoints.GroupBy(x => new { x.EtfId, x.EtfDate}).Select(y => y.First());
}
class Datapoint
{
    public int Id  {get;set;}
    public string EtfId {get;set;}
    public DateTime EtfDate  {get;set;} 
}

... which is a much more elegant solution, but will do nothing to get rid of the duplicates in the database.

I believe using this approach you're removing all the entities that are duplicates. E.g.: When there is 2 duplicates, `g.Count() = 2` for all the duplicates. Therefore you're selecting them in your Where statement. — laurensvm, Jun 18 '20 at 18:14

Remove duplicate entities based on multiple properties using SQL/LINQ

2 Answers2