0

I have a SQL Server table that I would like to find "similar" rows. My first thought was to use GroupBy() and came up with the below:

var detections = await context.Detections
    .Where(d => d.TimeStamp > DateTime.Now.AddDays(-1)) //Could be much more logic here
    .GroupBy(det => new DetectionGroupByModel
    {
        DetectedText = det.DetectedText,
        HoursFromEpoch = EF.Functions.DateDiffHour(DateTimeOffset.UnixEpoch.Date, det.TimeStamp),
        // A few other columns
    })
    .Select(grp => new DetectionGroupResultsDetection
    {
        Count = grp.Count(),
        TimeStamp = grp.OrderByDescending(det => det.TimeStamp).First().TimeStamp,
        Detection = grp.OrderByDescending(det => det.TimeStamp).First(),
    })
    .OrderByDescending(det => det.TimeStamp)
    .Skip(0).Take(25)
    .ToListAsync();

This does work exactly as expected, but the performance becomes very underwhelming when returning over a few thousand records. I was working in SQL Server Management Studio and came up with the below query:

SELECT *
FROM (
  SELECT *, 
  COUNT(*) OVER (PARTITION BY DATEDIFF(hour, '1970-01-01 00:00:00', TimeStamp), DetectedText, DetectedObject, DeviceId ORDER BY TimeStamp DESC) AS [Rank]
  FROM [Detections]
  WHERE TimeStamp > '2020-01-01 00:00:00' --Could be much more logic here
) AS [DuplicationCount]
WHERE [Rank] = 1
ORDER BY TimeStamp DESC
OFFSET 0 ROWS FETCH NEXT 25 ROWS ONLY

This will execute faster on more than 1,000,000 records than the above on 10,000 records. Like several orders of magnitude faster. I just do not know how to reverse engineer this to LINQ.

This currently will not get me the duplicate count, like the GroupBy() does, but this is so much faster, I could just run this again with WHERE [Rank] = 2 and do grouping on only those results. I am fine if that loads a bit after initial page load.

I do have complete end to end control of the DB and C# code (read: one man shop). So a redesign of tables is not out of the question. I have been adding and bench-marking indexes on the table and getting some gains, just not anywhere near what I expected.

I am using .NET 6.0 and latest Entity Framework 6.0.0 packages.

TonySchoborg
  • 81
  • 1
  • 11
  • Can you find out the sql that is being generated by EF? https://stackoverflow.com/questions/37527783/get-sql-code-from-an-entity-framework-core-iqueryablet (or use an SQL Profiler) – gunr2171 Dec 03 '21 at 23:37
  • Please let me know if this helps: https://pastebin.com/nrN5iUfA – Enigmativity Dec 04 '21 at 01:27
  • @Enigmativity Thanks for the reply, I get the following exception: Translation of 'Select' which contains grouping parameter without composition is not supported. I am messing with it for a little now to try and get it to work. – TonySchoborg Dec 04 '21 at 02:27
  • Could you use a raw sql or stored procedure to put your query on it? – Pato Dec 04 '21 at 12:27
  • @Pato Yeah, that is what I am working on now. It seems it may be the only way. I have some pretty complicated search capabilities, so getting that all figured out. – TonySchoborg Dec 04 '21 at 17:27
  • It's very difficult to get EF to use window functions, so I think it's probably better to just do it in SQL directly using `FromSqlRaw`. Why would you use `COUNT(*)` instead of `ROW_NUMBER`? – Charlieface Dec 05 '21 at 09:11
  • I recommend you to use SQL Server Profiler to track the query which was created from EF. After you did that you are able to analyse the execution plan. I had enormous performance issues with `varchar` columns. When I had `string` properties EF Core parsed them to `nvarchar(max)`. The SQL server then had to parse those values from `nvarchar` to `varchar` for each affected row. The solution for me was to declare the column data type in the fluent mapping. – Sebastian S. Dec 05 '21 at 11:39
  • @Charlieface I am not familiar, or know the difference, between `ROW_NUMBER` and `Count(*)`, I will take a look. Thank You. – TonySchoborg Dec 05 '21 at 16:17
  • 1
    @Sebastian S. I have been analyzing the profiler, just learned about it. There is a learning curve there, but have been able to make some optimizations based on that. I have noticed specifying some size limits had helped tremendously. I will take a look at the property mapping as you mentioned. Thank you. – TonySchoborg Dec 05 '21 at 16:22

0 Answers0