Efficiently filtering a large collection of POCO entities

Question

I'm working on improving the performance of a linq filter against a large collection of POCOs, but local testing is indicating a CPU bottleneck.

I was initially trying to do this to reduce the load on a SQL server by retrieving a large result set and loading it into memory on a separate processing server, then filtering this result set in .Net.

Here is the demo code:

public class CustomClass
{
    public int Id { get; set; }
    public int OtherId { get; set;}
    public DateTime Date { get; set; }
}

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();

    foreach (OtherCustomClass foo in _bar)
    {
        // original linq-to-entities query,
        // get most recent Ids that apply to OtherId
        List<CustomClass> filteredItems = (
            from item in allItems
            where item.OtherId == foo.OtherId && item.Date <= foo.Date
            group item by item.Id into groupItems
            select groupItems.OrderByDescending(i => i.Date).First()).ToList();

        DoOtherStuff(filteredItems);
    }
}

This spikes my 4 cores to 100% CPU for 1m30s, and is not feasible for a production system. I ran the performance analyzer in VS2012 and 30% of the time is the get call to item.OtherId.

I started to rewrite the linq to plain code to see if I could get any speed improvement, but so far I have not had any luck. Here is the plain code rewrite:

private List<CustomClass> FilterCustomClassByIdAndDate(
    List<CustomClass> items, int id, DateTime date)
{
    var mostRecentCustomClass = new Dictionary<int, CustomClass>();

    foreach (CustomClass item in items)
    {
        if (item.Id != id || item.Date > date) { continue; }

        CustomClass mostRecent;
        if (mostRecentCustomClass.TryGetValue(item.Id, out mostRecent) &&
            mostRecent.Date >= item.Date) 
        { continue; }

        mostRecentCustomClass[item.Id] = item;
    }

    var filteredItems = new List<CustomClass>();

    foreach (KeyValuePair<int, CustomClass> pair in mostRecentCustomClass)
    {
        filteredItems.Add(pair.Value);
    }

    return filteredItems;
}

This is still hitting 100% CPU and 30% on the item.OrderId call. Has anyone had a similar issue in the past or perhaps has some idea on how to improve this?

Edit: Code showing a massive improvement

Thanks to @FastAl, this code ran through the _bar -> DoOtherStuff(filteredItems) loop in under a second:

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();

    var indexedItems = new Dictionary<int, List<CustomClass>>();

    foreach (CustomClass item in allItems)
    {
        List<CustomClass> allByOtherId;

        if (!indexedItems.TryGetValue(item.OtherId, out allByOtherId)) 
        {
            allByOtherId = new List<CustomClass>();
            indexedItems[item.OtherId] = allByOtherId;
        }

        allByOtherId.Add(item);
    }

    foreach (OtherCustomClass foo in _bar)
    {
        List<CustomClass> filteredItems;

        if (!indexedItems.ContainsKey(foo.OtherId))
        {
            filteredItems = new List<CustomClass>();
        }
        else
        {
            List<CustomClass> filteredItems = (
                from item in indexedItems[foo.OtherId]
                where item.Date <= foo.Date
                group item by item.Id into groupItems
                select groupItems.OrderByDescending(i => i.Date).First())
                .ToList();
        }

        DoOtherStuff(filteredItems);
    }
}

are you loading 800,000 items to memory? According to your example you have only on table, but is it true? — Dilshod, May 07 '13 at 18:23
I know this isn't an answer to your exact question but Is there any way you can offload this to a stored procedure on the SQL server? I've noticed huge performance gains by moving my problematic LINQ to a stored procedure and creating the appropriate indexes. — Matt Johnson, May 07 '13 at 18:24
If you are using a database, why are you fetching all the data into a list and then filtering the list? The entire point of linq was to make it easy to filter ***at the database***. The main problem here seems to be the use of lists. — Marc Gravell, May 07 '13 at 18:24
@Matt for the record, the benefit there us rarely "stored procedures" - these days there is rarely any noticeable perf difference between stored procs and parameterized tsql. A regular tsql query does the job very nicely. — Marc Gravell, May 07 '13 at 18:26
you should not do sorting your data every time your application starts or whenever you need recent CustomClass object. Use Indexed Views in SQL to get rid of memory and time issues. — Dilshod, May 07 '13 at 18:31
Filtering efficiently is what databases are for; take advantage of that. I doubt you're doing the database server any favors by making it serve out all 800,000 records instead of just the handful you're actually interested in. My experience is that I/O and network operations create database bottlenecks far more often than CPU usage. — Jeremy Todd, May 07 '13 at 18:33
There is a sproc that I wrote to do this in the database, and right now it is faster on my test environment with 1 user. However, the performance info coming back from the DMVs in the database are showing that sproc getting called over 6,000 times and doing a ludicrous number of reads. The issue is that a lot of the data for the filter is duplicated, which is why I wanted to load it once. Indexing has also helped, but I was hoping to get closer to acceptable performance so I can put this process on commodity hardware for scaling. — Laurence, May 07 '13 at 18:36
@MattJohnson, Marc is right, generally after the first call the tsql plan is cached and readily available to the sql query optimizer. — Laurence, May 07 '13 at 18:38
@Laurence I assume using indexed view will solve your problem. — Dilshod, May 07 '13 at 18:39
@Dilshod I'm loading them into memory to save multiple enumerations of IEnumerable or similar. The indexed view will not be that useful since all the data comes from 1 table that already has comprehensive indexes. — Laurence, May 07 '13 at 18:57
It sounds to me like your approach is simply bad. Am I correct that you're trying to filter a large list of items for each item in another large list of items? Do you need to filter the entire set at once? Can you pre-generate a many-to-many filter join table? There are lots of different approaches that would avoid the iterative process you're using. — Erik Funkenbusch, May 07 '13 at 19:14
@Laurence I don't know your server usage but with 800,000 rows are you sure nothing comes into your DB that invalidates your execution plan? It is possible for LINQ to be as fast but not faster than correctly written sprocs and optimized Databases. There are just too many knobs you can turn server side to optimize sprocs that allow a good db to out perform anytype of automatic tuning. That is why I suggested it. http://stackoverflow.com/questions/752064/performance-difference-between-linq-and-stored-procedures — Matt Johnson, May 07 '13 at 19:29
@MystereMan You are correct, I'm filtering a large list [A] for each item in a smaller (<20,000 records) list [B]. The database naturally does this better than `.Net` but is reading a large portion of the table/index for each item in [B], and returning much of the same information. I was hoping to offload the processing. — Laurence, May 07 '13 at 19:29
@MattJohnson: I agree, the database is better at filtering. However, I am getting a lot of the same information multiple times when using the single sproc, and was hoping to reduce that while still retaining decent performance. Maybe I should remove the database from the question? @Jeremy Todd: Unfortunately at the point of this process I know I need all of those 800,000 items for at least 1 record in the `foo` list, and the database would have to serve up at least 800,000 records cumulatively for the individual calls. — Laurence, May 07 '13 at 19:39
@Laurence - are records 1:many? or many:many? if 1:many, you could reduce further filters by removing items from the list. You could also break out if you know there are no more records for the key. There are literally tons of different optimizations you could do if you're going to do it in code. You could also pre-process all the filters and create an index table which you can use to bypass the large sort. — Erik Funkenbusch, May 07 '13 at 20:05

score 4 · Accepted Answer · answered May 07 '13 at 19:55

Use a dictionary of lists.

After you load your items, loop thru them once to build a dictionary of list . Note the inserted Loop and change in where Clause.

Please forgive my errors, I only had 4 minutes ;-) Learn to love the dictionary. It's wicked fast - uses one of the fastest search / insert methods there is. Really wonderful little tool from M$.

my honest suggestion - do it on the database. ask yourself - did your try it there? I've been at this a while and I can never still tell which of two unknowns would be faster without actually testing it first (unless it's really obvious, but you wouldn't have posted this here if it were). Double check the DB had an index on OtherID or else it's facing the same problem your linq statement is (linear search).

public class CustomClass
{
    public int Id { get; set; }
    public int OtherId { get; set;}
    public DateTime Date { get; set; }
}

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();
    var index1 = new Dictionary <int, CustomClass>; 
    foreach (OtherCustomClass foo1 in allItems)
    {
        List<CustomClass> allOtherIDs ;
        allOtherIDs=null;
        if (!index1.TryGetValue(foo1.OtherID,allOtherIDs))
         {
            allOtherIDs=new List<CustomClass>;
            index1.add(foo1.OtherID,allOtherIDs);
        }
        allOtherIDs(foo1.OtherID)=foo1;
    }


    foreach (OtherCustomClass foo in _bar)
    {
        // original linq-to-entities query,
        // get most recent Ids that apply to OtherId
        List<CustomClass> filteredItems = (
            from item in allOtherIDs(foo.OtherID)
            where item.Date <= foo.Date
            group item by item.Id into groupItems
            select groupItems.OrderByDescending(i => i.Date).First()).ToList();

        DoOtherStuff(filteredItems);
    }
}

Okay, wow. I made this change and for my 800,000 records it went from 1m30s at 100% CPU to about 1 second at 90% CPU on one core. Dictionary is great! The database version is still marginally faster overall but only due to the initial retrieval of the records from the DB; it actually sped up the processing significantly. — Laurence, May 07 '13 at 20:36
Great to hear! Now you know why I'm so enthusiastic about it! — FastAl, May 09 '13 at 13:44

Efficiently filtering a large collection of POCO entities

1 Answers1

Use a dictionary of lists.