-1

My goal is to speed up a query, and I thought to leverage parallelism, lets assume that I have 2,000 items in ids list, and I split them to 4 lists each one with 500 ids, and I want to open 4 treads that each one will create a DB call and to unite their results, in order to achieve that I used Parallel.ForEach, but it did not improved the performance of the query because apparently it does not well suited to io bound operations: Parallel execution for IO bound operations

The code in the if block uses parallel for each, vs the code in the else block that do it in a regular foreach.

The problem is that the method that contains this query is not async (because it is in a very legacy component) and it can not be change to async, and basically I want to do parallel io bound calculation inside non async method (via Entity Framework).

What are the best practices to achieve this goal? I saw that maybe I can use Task.WaitAll() for that, I do not care to blocking the thread that runs this query, I am more concerned that something will went wrong with the Task.WaitAll() that is called from a non async method

I use Entity Framework as ORM over a SQL database, for each thread I opens a separate context because the context is not thread safe.

Maybe the lock that I use is the one that cause me the problem, I can change it to a ConcurrentDictionary.

The scenario depicted in the code below is simplified from the one I need to improve, in our real application I do need to read the related entities after I loaded there ids, and to perform a complicated calculation on them.

Code:

//ids.Bucketize(bucketSize: 500) -> split one big list, to few lists each one with 500 ids 
IEnumerable<IEnumerable<long>> idsToLoad = ids.Bucketize(bucketSize: 500);
if (ShouldLoadDataInParallel())
{
    object parallelismLock = new object();
    
    Parallel.ForEach(idsToLoad,
        new ParallelOptions { MaxDegreeOfParallelism = 4 },
        (IEnumerable<long> bucket) =>
        {
        List<long> loadedIds = GetIdsQueryResult(bucket);

            lock (parallelismLock)
            {
                allLoadedIds.AddRange(loadedIds );
            }
        });
}
else
{
    foreach (IEnumerable<long> bucket in idsToLoad)
    {
        List<long> loadedIds = GetIdsQueryResult(bucket);

        allLoadedIds.AddRange(loadedIds);
    }
}
  • 2
    You don't. You fix the data access code instead. There's no point in using EF Core or any ORM to perform a batch delete by ID. There are no objects involved in this case – Panagiotis Kanavos Dec 02 '21 at 16:11
  • 3
    Besides, performing batch operations in *parallel* will only *increase* blocking, not reduce it. You still use the same disks, same CPU, same transaction log. – Panagiotis Kanavos Dec 02 '21 at 16:13
  • the delete was just to simplify the use case, I do load the related entities and perform few calculation with them – Avihai Sudai Dec 02 '21 at 16:15
  • 1
    The SQL you need to execute is `DELETE SomeTable where ID in (...)`. EF Core can't do that, but Dapper can. You can use Dapper on the DbContext's connection to execute eg `Execute("delete theTable where ID in (@ids):, new {ids=someList}")`. Execute that repeatedly to delete batches without flooding the transaction log – Panagiotis Kanavos Dec 02 '21 at 16:15
  • 1
    `the delete was just to simplify the use case,` in that case ask your real question and describe the real problem. The specifics matter a lot. The only thing that's certain is that "parallelizing" will only harm performance. Often exponentially – Panagiotis Kanavos Dec 02 '21 at 16:16
  • 1
    Besides, if the data comes from the database you can modify the `SELECT` query to actually delete the matching rows. There's no reason to pull data to the client only to send back a DELETE. – Panagiotis Kanavos Dec 02 '21 at 16:17
  • @PanagiotisKanavos I fixed the question – Avihai Sudai Dec 02 '21 at 16:24
  • In order to perform lot of insert in Db, I think that the best practice is to use the Bulk provided by entity framework. https://entityframework-extensions.net/bulk-insert. For example : when saving 5000 entities in db with ctx.saveChanges the estimated duration is 6000ms. With Bulk the estimated time is 50ms. Time has been reduced by 120. – guiz Dec 02 '21 at 16:25
  • @guiz the problem is not related to write, only for read operation, while updating the data we already uses bulk insert – Avihai Sudai Dec 02 '21 at 16:26
  • Do you use multiple instances of your entity framework context in order to perform the read operations ? – guiz Dec 02 '21 at 16:31
  • You still don't explain anything. BTW EF doesn't have bulk insert. And there's no bulk update or delete at all. That library actually uses SqlBulkCopy which is part of ADO.NET, not EF Core – Panagiotis Kanavos Dec 02 '21 at 16:31
  • @guiz that third-party library isn't using EF Core despite the name. EF Core doesn't have bulk insert. What that library actually does is use ADO.NET's SqlBulkCopy with a list of objects. You don't need EF Core to do this at all. – Panagiotis Kanavos Dec 02 '21 at 16:32
  • @guiz as for parallelizing, the problem isn't the context instances, it's the very idea itself. If you execute 10 commands that can block each other you cause 10! worse performance. The disks, network are the same, so you end up *reducing* performance. And `Parallel.ForEach` is *completely* unsuitable for async operations. – Panagiotis Kanavos Dec 02 '21 at 16:34
  • Thanks @PanagiotisKanavos , good to know – guiz Dec 02 '21 at 16:34
  • @AvihaiSudai you haven't posted *any database or EF code at all*. What's important is whatever is inside `GetIdsQueryResult`. `Parallel.ForEach` is meant for in-memory parallelism, not concurrent operations anyway, so your code does nothing but waste threads waiting for the database to respond. – Panagiotis Kanavos Dec 02 '21 at 16:36
  • A Table Valued Parameter, or `SqlBulkCopy` into a temp table, is probably the way forward. Once you have the data on the server you can do a joined `DELETE` – Charlieface Dec 02 '21 at 20:06
  • `assume that I have 2,000 items in ids list` that's *NO DATA* at all. You shouldn't need any kind of parallel execution for just 2K IDs. You can write a single `WHERE id IN (1,2,....2000)` clause as the limit of the `IN` clause is in the thousands. If the query is slow, you need to fix the query. Queries that return so little data are slow because of missing indexes, inefficient queries, too much IO, too much data for the network bandwidth or blocking. Running the same inefficient query 4 times will only make things *worse*, by splitting the resources in 4 while increasing blockin by 6 – Panagiotis Kanavos Dec 03 '21 at 07:53
  • Perhaps _TPL Dataflow_ would be a better approach, particularly for I/O-bound operations –  Dec 03 '21 at 23:44

1 Answers1

0

What are the best practices [for running multiple queries in parallel]?

Parallel.ForEach with seperate DbContext/SqlConnection is a fine approach.

It's just that running your queries in parallel is not really helpful here.

If your 4 queries hit 4 separate databases, then you might get a nice improvement. But there's many reasons why running 4 separate queries in parallel on a single instance might not be faster than running a single large query. Among these are blocking, resource contention, server-side query parallelism, and duplicating work between the queries.

And so

My goal is to speed up a query, and I thought to leverage parallelism

And so this is not usually a good approach to speeding up a query. There are, however, many good ways to speed up queries, so if you post a new question with the details of the query and perhaps some sample data you might get some better suggestions.

David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67