0

I have a very long running operation that could be split into many shorter ones - if I was able to call DbContextasynchronously.

First of all, a data service:

public class MyDataService : IMyDataService
{
    private readonly IMyRepository _arepository;
    private readonly ITransactionScope _txScope;

    public MyDataService(IMyRepository repository, ITransactionScope txScope)
    {
        _repository = repository;
        _txScope = txScope;
    }

    public async Task<Result> CreateOrAppendAsync(SomeObject[] someObjects)
    {
        return await _txScope.DoInTransactionAsync(
            _principal.GetName(),
            "create-or-append",
            null,
            async () => await _repository.CreateOrAppendAsync(someObjects));
    }
}

And then the repository:

public class MyRepository : IRepository
{
    private readonly InternalDbContext _dbContext;

    public Repository(InternalDbContext dbContext)
    {
        _dbContext = dbContext;
    }

    public async Task<Result> CreateOrAppendAsync(SomeObject[] someObjects)
    {
        var tasks = new List<Task<PartOfResult>>();
        foreach (var batch in someObjects.Split(10))
        {

            tasks.Add(Task.Run(() =>
            {
                // do something with _dbContext to get result
            }));
        }
        var results = await Task.WhenAll(tasks.ToArray());
        return results.Combine();
    }
}

This of course does not work because of:

A second operation was started on this context instance before a previous operation completed. This is usually caused by different threads concurrently using the same instance of DbContext. For more information on how to avoid threading issues with DbContext, see https://go.microsoft.com/fwlink/?linkid=2097913.

The question is: How would I do this correctly? Previously I've just copied the DbContext, but since all of this is part of the same transaction, that is not possible. The Postgres database should be able to handle multiple calls, Dotnet has such nice keywords to make parallel processing easy, but how does EF Core work together with all of this.

Stefan S.
  • 3,950
  • 5
  • 25
  • 77
  • `DbContext` is a unit-of-work and it should always be short-lived (i.e. not live longer than absolutely necessary, e.g. in ASP.NET it should be request-scoped; or in a more traditional batch-process it should be (akin to being) scoped to a single method and not stored in a field. – Dai Jun 15 '23 at 09:45
  • Where is `DoInTransactionAsync` defined? Are you certain you're using a "_real_" SQL `TRANSACTION` here - or just EF's _ersatz_ transactions? (fun-fact: it's easy to mess-up your DB with EF transactions unless you're very, very careful with how all of your rollbacks work, also consider your transaction-isolation level too). – Dai Jun 15 '23 at 09:47
  • Use two `DbContext` and than wrap the two call in a `TransactionScope` with `TransactionScopeAsyncFlowOption.Enabled` might work. – Magnus Jun 15 '23 at 13:39
  • @Dai My bad, I didn't see that the transaction was some third-party class. But it uses the transaction off the `DbContext`. Sooo ersatz transaction? Can I use database transaction directly? – Stefan S. Jun 16 '23 at 05:51

2 Answers2

1

You are trying to run queries not asynchronously, but concurrently. These are completely different things. If you want to run it asynchronously you just remove the Task.Run-part

public async Task<Result> CreateOrAppendAsync(SomeObject[] someObjects)
{
      var tasks = new List<Task<PartOfResult>>();
      foreach (var batch in someObjects.Split(10))
      {
          await CreateOrAppendAsync(batch);
      }
}

The point here is that the thread should not block while you are doing database operations.

If you want to run queries concurrently you need to create multiple dbContext objects, there is just no getting around this. Running concurrent queries might or might not improve anything, depending on the database and the queries.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • The question is: when I have multiple dbContext objects, how do I handle a transaction between them? – Stefan S. Jun 15 '23 at 08:51
  • @StefanS. Have a transaction for each and then you could manually rollback when one of them fails. – Silvermind Jun 15 '23 at 08:58
  • 2
    @StefanS. I'm no database expert, but I'm fairly sure you don't, and that transactions are inherently serial things. But you need to ask yourself what the real purpose is, ensuring UI is responsive? Then async is the way to go. Total runtime? This has more to do with database and query design. I would not expect batching to improve overall performance, since the db can already use concurrency internally. – JonasH Jun 15 '23 at 09:03
  • @JonasH It's the total runtime. For each `SomeObject`, there is a big SELECT to question whether the object should be INSERTed or UPDATEd. And then to build the entity in EF core, there are multiple calls to the database necessary. I imagine all the reading of the database could be done in parallel, just the final inserting / updating has to be done one after the other. – Stefan S. Jun 16 '23 at 05:32
  • @StefanS. The total amount of work will not change just because you are doing things in parallel. You might mean total *latency*. Regardless, you might want consider writing manual SQL for performance sensitive parts. EF is convenient, but does not always produce great SQL. That is likely better than trying to shoehorn in concurrency where it does not fit. Insert or Update should be possible to do with a single query, see [INSERT OR UPDATE on SQL Server](https://stackoverflow.com/questions/108403/solutions-for-insert-or-update-on-sql-server) – JonasH Jun 16 '23 at 07:51
-3

Based on my understanding of your question, it seems that you need to perform a large number of operations in sequence, and because you don't want to block main thread and make it responsive. To ensure data consistency, it is recommended that you use a distributed transaction coordinator.

mostafa
  • 261
  • 1
  • 2
  • 10
  • The OP does not seem to be doing anything that requires distributed transactions. – Dai Jun 15 '23 at 09:48
  • @Dai distributed means that you want to do some operation in different locations(threads, process, time, server or any thing else), based on this question he wants to do a long-running operation in different threads in different times – mostafa Jun 15 '23 at 09:50
  • I think you're inferring too much from the OP's post and coming to an unsupported conclusion; and besides, in-proc work (which is what the OP is doing) is the very-opposite of a distributed system. – Dai Jun 15 '23 at 09:52
  • @Dai As he asked, the task is long-running, so he decided to run it in parallel, BUT THE MAIN REASON FOR ASKING THIS QUESTION IS THAT EVERY TASK RUN IN A DIFFERENT THREAD, but dbContext is not thread-safe, but why he needs to break it to smaller tasks and run in the same transaction with the same dbContext instance? because he wants to guaranty data consistency – mostafa Jun 15 '23 at 09:58
  • Using a distributed transaction coordinator won't help with the lack of thread-safety in `DbContext` either. – Dai Jun 15 '23 at 09:59
  • _"BUT THE MAIN REASON FOR ASKING THIS QUESTION IS THAT EVERY TASK RUN IN A DIFFERENT THREAD"_ - I might suggest it's time to take a 5-minute breather from your computer... – Dai Jun 15 '23 at 09:59
  • @Dai **tasks.Add(Task.Run(() =>** If someone attempts to perform operations on data in parallel, what would be the appropriate way to handle such a scenario? – mostafa Jun 15 '23 at 10:02
  • _"If someone attempts to perform operations on data in parallel, what would be the appropriate way to handle such a scenario? "_ - there is no short simple answer for this - not even a rule-of-thumb because so much depends on exactly what the "operation" is - but right-away I will say that EF _should not be used_ because EF is not intended for high-performance or long-running scenarios: instead, I'd suggest use ADO.NET directly (which also means you can use MARS to support true concurrent queries). If it's something that can be done entirely within the DB server then ditch C#, etc. – Dai Jun 15 '23 at 10:05
  • @Dai now we have a scenario that Stephan needs to run a long-running task, he decided to break it into smaller tasks and run them in different threads to speed up the operation, he needs to open a transaction to make sure data consistency is preserved. this is one of the places where distributed transaction can be used – mostafa Jun 15 '23 at 10:10
  • Again, I feel you're inferring too much, especially this: _"he decided to break it into smaller tasks and run them in different threads to speed up the operation"_ - that isn't my reading of his post at all and he never mentions performance problems at all and there are many reasons for splitting-up an operation/transaction into smaller units-of-work that don't involve performance. – Dai Jun 15 '23 at 10:14
  • I think @mostafa is right, this is mostly a performance problem, because there are a lot of heavy read operations (which I assume could be done in parallel) necessary before a single write operation per object is done. I'll take a look at distributed transactions, thanks :) – Stefan S. Jun 16 '23 at 05:41
  • @StefanS. Also, try to use bulk operation for data updates, and row version timestamp – mostafa Jun 17 '23 at 09:21