0

Background

So, I am using a React frontend and a .net core 3.1 backend for a webapp where I display a view with a list of data. The list is often times several thousands long. In this case its around 7500. We virtualize it to prevent sluggishness. Along with the display of data, every row has a column with the latest logchange someone did on that datarow. The logs and the rest of the data for every row comes from two different applications with their own databases. The log data consists of the name, and date of when the log was made, is also supposed to be rendering for every row.

The problem

When you route to the page, a useEffect fires that fetches the rows from one of the databases. When I get the response, I filter out all of the ids from the data and then I post that list to the other endpoint to request the latest log from every id. This endpoint queries the logging database. The number of ids I am passing to the endpoint is about 7200+. It wont always be this much, but sometimes.

Troubleshooting

This is the query that is giving me trouble in the log endpoint

public async Task<IActionResult> GetLatestLog(ODataActionParameters parameters)
{
    var LogIds= (LogIds)parameters["LogIds"];
    var results = await context.Set<LogEvent>()
                    .Where(x => LogIds.Ids.Contains(x.Id)).ToListAsync(); //55 600 entities

    results = results
              .GroupBy(x => x.ContextId)
              .Select(x => x.OrderByDescending(p => p.CreationDate).First()).ToList(); //7 500 entities

    var transformed = results.Select(MapEntityToLogEvent).ToList();

    return Ok(transformed);
}

The first db query takes around 25 seconds (!) and returns around 56000 entities.

The second linq takes about 2 seconds, and returns around 7500 entites, and the mapping takes around 1 second.

The database is SQL server, and there are three indexes, one of which is Id, the other two are irrelevant for this assignment.

I have tried different queries, AsNoTracking, but to no avail.

Obviously this is horrible. Do you know of a way to optimize this query?

Svyatoslav Danyliv
  • 21,911
  • 3
  • 16
  • 32
  • How many `LogIds.Ids` are there in your incoming query for filtering? If this is a small number or limited by the application then we need to look elsewhere for improvements like how broad are the returned entities and are there proper indexes setup. – Igor Jan 03 '23 at 14:57
  • 2
    You should run a profiler to see where the time is taken – Daniel A. White Jan 03 '23 at 14:59
  • Not 100% sure, but you _may_ be reaching numbers, where it becomes dangerous to use `.Contains` in the query. – Fildor Jan 03 '23 at 14:59
  • 3
    Also you shouldn’t do ToListAsync until you do all the query work – Daniel A. White Jan 03 '23 at 15:00
  • 1
    @DanielA.White - my guess is they ran into a runtime exception with `GroupBy` which is common in EFCore if you do not understand the limitations and how to use GroupBy so they "fixed" it by using `ToListAsync` before. – Igor Jan 03 '23 at 15:01
  • 1
    If nothing else you can also try adding `AsNoTracking()` before calling `ToListAsync()`. That might speed up the results being materialized but without more information this is just a shot in the dark and likely is not a "good fix". – Igor Jan 03 '23 at 15:06
  • I'd probably create a View for "Newest Entry per ContextId". That should make things _much_ easier (and probably faster). – Fildor Jan 03 '23 at 15:07
  • @Igor There are about 7200+ separate id's, as can be seen also in the number of entities left in the second linq after filtering for the latest creationdate. – swedish_junior_dev Jan 03 '23 at 15:13
  • Can you filter for latest creation date _first_? Do you have indexes on the table? – Fildor Jan 03 '23 at 15:14
  • @DanielA.White I know of the Profiler, but havent used it before, I will try it first thing in the morning and comeback! Regarding ToListAsync I followed the style of similiar endpoints in the application, not sure why its runt there, really. I will try moving to the second query. – swedish_junior_dev Jan 03 '23 at 15:15
  • @Igor I will try AsNoTracking and comeback asap! – swedish_junior_dev Jan 03 '23 at 15:15
  • @Fildor there are indexes for other properties, like Id, but not for creation date. Maybe I could create one, I will think about it! – swedish_junior_dev Jan 03 '23 at 15:17
  • @Igor AsNoTracking did nothing, unfortunately – swedish_junior_dev Jan 03 '23 at 15:21
  • That is a lot of ids you are passing in. What is your RDBMS (sql server, oracle, mysql, etc)? – Igor Jan 03 '23 at 15:21
  • @Igor Sql server.. I know, it wont always be that much, but I thought , if the user dont mind waiting a few seconds its okay, but 30+ seconds is insane. – swedish_junior_dev Jan 03 '23 at 15:23
  • Start by checking your EventLog.Id index. Is this a clustered index? If not be sure to add the other columns in the table as Include to the index. Is there fragmentation on the index? If so rebuild it. Next capture the query plan and the IO statistics, you can use Sql Profiler for this. If the selected query plan is as good as it can get then can you limit the result set? You can build in paging which is fairly simple with Odata (which you are using) and limit the result set to the page size. If paging is not an option then you need to refactor and use TVP instead of IN/Contains. – Igor Jan 03 '23 at 15:28
  • It can be improved by third party extension. But unfortunately question is closed and I cannot post answer. – Svyatoslav Danyliv Jan 03 '23 at 16:24
  • In case you are not aware. The first statement with the ToListAsync is the only DB query, this materializes the data from the database. The subsequent statements are all executed in memory. The only optimization possible there is if `ContextId` is a string then you can specify the type of string comparison to use which can save a second or 2. – Igor Jan 05 '23 at 13:34
  • Ideally though you want to combine the first 2 statements and possibly the 3rd as well but that is hard to tell because you did not include code for the projection. The more that can be executed DB side which results in less data being piped back to the application the better. – Igor Jan 05 '23 at 13:35
  • Also if this is for a user there is no way I would ever need to visually sort / compare a result set of thousands of records.... ever. What I would expect as a user is the ability to filter, sort, and page the results. Then the user can view a page of 100 or so records at a time. That would significantly reduce the cost of data retrieval and keep your application responsive to the users input. OData has built in mechanisms for paging that you can take advantage of. – Igor Jan 05 '23 at 13:37
  • Finally you never addressed the questions I had about index structure and query profiling. Profile the incoming query and see if the issue is the number of parameters being passed or the index being used. See [include the actual Execution Plan](https://stackoverflow.com/a/7359705/1260204), you could use [Paste the Plan](https://www.brentozar.com/pastetheplan/) and share the link in your question. – Igor Jan 05 '23 at 13:39
  • @Igor, there are two ways how to improve this query. 1) rewrite to make grouping translatable (that's why you see ToList at the start) 2) use third party extension which can handle more than thousand ids for `Contains` via temporary table. I vote for reopen this question. – Svyatoslav Danyliv Jan 05 '23 at 14:11
  • I think you can speed up the query, but I would argue that real question is do you really need all the 7500+ records delivered to the frontend at the same time? – Guru Stron Jan 06 '23 at 06:49

1 Answers1

0

There are two ways, how to improve your query:

Pure EF Core

We can rewrite LINQ query to be translatable and avoid unnecessary records on the client side. Note that your GroupBy will work with EF Core 6:

public async Task<IActionResult> GetLatestLog(ODataActionParameters parameters)
{
    var LogIds = (LogIds)parameters["LogIds"];
    var results = context.Set<LogEvent>()
       .Where(x => LogIds.Ids.Contains(x.Id)); 

    results = 
        from d in results.Select(d => new { d.ContextId }).Distinct()
        from r in results
            .Where(r => r.ContextId == d.ContextId)
            .OrderByDescending(r => r.CreationDate)
            .Take(1)
        select r;

    var transformed = await results.Select(MapEntityToLogEvent).ToListAsync();

    return Ok(transformed);
}

Using third party extension

With linq2db.EntityFrameworkCore we can use full power of the SQL and make most efficient query in this case. Big list of ids can fast be copied to temorary table and used in result query. Retrieveing only latest records by ContextId can be done effectively with Windows Function ROW_NUMBER.

Disclaimer I'm maintainer of this library.

// helper class for creating temporary table
class IdsTable
{
    public int Id { get; set; }
}

public async Task<IActionResult> GetLatestLog(ODataActionParameters parameters)
{
    var LogIds = (LogIds)parameters["LogIds"];
    using var db = context.CreateLinqToDBConnection();
    TempTable<IdsTable>? idsTable = null;

    var results = context.Set<LogEvent>().AsQueryable();

    try
    {
        // avoid using temporary table for small amount of Ids
        if (LogIds.Ids.Count() < 20)
        {
            results = results.Where(x => LogIds.Ids.Contains(x.Id));
        }
        else
        {
            // initializing temporary table
            idsTable = await db.CreateTampTableAsync(LogIds.Ids.Select(id => new IdsTable { Id = id }, tableName: "temporaryIds"));
            
            // filter via join
            results =
                from t in idsTable
                join r in results on t.Id equals r.Id
                select r;
        }

        // selecting last log
        results = 
            from r in results
            select new 
            {
                r,
                rn = Sql.Ext.RowNumber().Over()
                    .PartitionBy(r.ContextId)
                    .OrderByDesc(r.CreationDate) 
                    .ToValue()
            } into s
            where s.rn == 1
            select s.r;

        var transformed = await results
            .Select(MapEntityToLogEvent)
            .ToListAsyncLinqToDB(); // we have to use our extension because of name collision with EF Core extensions

    }
    finally
    {
        // dropping temporaty table if it was used
        idsTable?.Dispose();
    }

    return Ok(transformed);
}

Warning

Also note that logs count will grow and you have to limit result set by date and probably count of retrieved records.

Svyatoslav Danyliv
  • 21,911
  • 3
  • 16
  • 32