1

I had some problems in my code causing too many open connections which causes them to close and get no http response. I have since refactored into something that looks like this:

List<<List<string>> batches = splitListOfUrlStringsIntoBatches(urls, 50); // where 50 is the batch size

I then do:

foreach (var batchList in listOfBatchLists)
{
    var insertForBatch = RunBatch(batchList);
    allInsertAmounts.Add(insertForBatch);
}

and run batch looks like:

    private int RunBatch(IEnumerable<string> batch)
    {
        var allWriteNum = 0;
        // this will run on one bound logical thread i think
        Parallel.ForEach(batch, (batchItem) => {
             var res = Client.GetAsync(batchItem.Item1).GetAwaiter().GetResult();
             var responseBody = res.Content.ReadAsStringAsync().GetAwaiter().GetResult();            
             var strongType = JsonConvert.DeserializeObject<StrongType>(responseBody);

             dbContext.add(strongType);
             allWriteNum++
        });
        return allWriteNum;
    }

The thing is if i increase the batch size to something stupid like 50,000 i dont get any closed connection errors and now I am not sure why..

Is it because the Parallel.foreach has optimisation to create the best amount of tasks and it can somehow work out that doing this will cause too many open connections? or too much cpu work?

Anderson Pimentel
  • 5,086
  • 2
  • 32
  • 54
Craig
  • 1,648
  • 1
  • 20
  • 45
  • 1
    `Parallel.ForEach` isn't going to spin up 50 individual components of work, it'll use however many threads are available to the scheduler by default and run, let's say, 4-5 parallel jobs at a time. It can be overridden via [MaxDegreeOfParallelism](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.paralleloptions.maxdegreeofparallelism?view=netframework-4.7.2#System_Threading_Tasks_ParallelOptions_MaxDegreeOfParallelism). What did your previous code look like? – ColinM Feb 19 '19 at 19:55
  • basically i had the foreach and parallel.foreach the other way around – Craig Feb 19 '19 at 19:58
  • Seems like in each thread of Parallel loop you're blocking it with GetResult, would anything change if you run it with async/await keywords to make sure threads will not wait on response and arrive back to thread pool? – mexanichp Feb 19 '19 at 19:59
  • @mexanich To be fair, I don't think blocking in this scenario is causing much of an issue because it's on a separate thread via `Parallel.ForEach` and the purpose of that thread is for requesting some HTTP data, processing, and storing in a database. In terms of threading, the only thing I'd suggest is using `Interlocked.Increment` rather than incrementing a number via `allWriteNum++` – ColinM Feb 19 '19 at 20:01
  • If I'm not mistaken, your parallel.foreach is using however many threads are available, but the code is not throwing the exception because the outer foreach is not parallel... So it will execute each batch synchronously – Stormhashe Feb 19 '19 at 20:02
  • right - so based on what you wrote Colin if I had my parralel.foreach and foreach the other way around then a thread would spin off that would be responsible for running - on that single thread - a huge batch. I think internally i had the pipeline of get, read, write as a list of tasks that I then did Task.WhenAll on which would have caused it to try and process say 50,000 tasks at once right? – Craig Feb 19 '19 at 20:04
  • @ColinM correct me if I'm wrong but isn't there a way when `Parallel.ForEach` schedules some amount of threads from thread pool where each of them is blocked by `GetResult` inside and do not return back to thread pool? If so I would assume that any new job of these 50000 tasks is just blocked by not available threads inside the thread pool and just waiting for them to get back. – mexanichp Feb 19 '19 at 20:08
  • Depending on the optimization of the code, database and `HttpClient` usage, I'd personally say your batches should be smaller and you'd be able to use one foreach with asynchrony, but that depends also on what the `HttpClient` error was. Ultimately what you want to achieve is network and disk bound I/O, which is what `async`/`await` is for. Do you really need threads to do this work? Or do you just need asynchronous execution. – ColinM Feb 19 '19 at 20:08
  • i just wanted to try and get them running in parallel. I am still pretty new to the framework and wasnt too sure if Task.WhenAll would allow the tasks to run on multiple cores - ie parralel – Craig Feb 19 '19 at 20:13
  • actually here is my original question detailing my issue, now i refactored and dont understand why increasing batch size doesnt also have the issue :D https://stackoverflow.com/questions/54733589/why-is-my-net-core-api-cancelling-requests – Craig Feb 19 '19 at 20:15
  • @mexanich although I haven't ever really dug deep into the detail of multi-threading and don't know the finer detail of `Parallel.ForEach`, I've never come across such a scenario to be able to answer, though on quick inspection of writing some `Thread.Sleep(5000)` code in a `Parallel.ForEach`, it doesn't look there are any issues with blocking multiple jobs at once, it can continue as normal. – ColinM Feb 19 '19 at 20:17
  • @ColinM alright, thanks for the explanation and brief inspection I need to take a closer look on that. – mexanichp Feb 19 '19 at 20:19
  • @Craig you really need to determine what the graceful limit of requests to the external API is, you could override `ServicePointManager.DefaultConnectionLimit` or you could come to a conclusion on how urgently this data needs to be processed. Optionally, you could redesign the code to send all of your HTTP requests first, and then handle your database work - this way you can scale to process more database requests because you know the database can handle it. – ColinM Feb 19 '19 at 20:22
  • 1
    From your previous question it appears that you've increased `DefaultConnectionLimit`. If the number of threads utilized by the parallel loop is less than that limit then why would you expect an error? The number of threads used by the parallel loop isn't specified, but it's limited. (It's not every day I see a question asking why something *does* work.) – Scott Hannen Feb 19 '19 at 20:22
  • @mjwills i didnt post entire code or it would be massive too massive - i am creating db contexts on demand for this only since surely that must be thread safe. will look into interlocked.increment though. – Craig Feb 19 '19 at 20:46
  • @ScottHannen i increased the batch size of the http requests a parallel loop iteration fires off to well above the default connection limit and it still worked – Craig Feb 19 '19 at 20:47

1 Answers1

2

You are accessing external resource (through http client) - IO operations, that is what async-await was designed for.

public async Task<StrongType> GetAsync(Item item)
{
    var response = await Client.GetAsync(item);
    var body = await response.Content.ReadAsStringAsync();  

    return JsonConvert.DeserializeObject<StrongType>(body);
}

public async Task Run(IEnumerable<Item> items)
{
    var tasks = items.Select(item => GetAsync(item));
    await Task.WhenAll(tasks);

    var loadedStrongTypes = tasks.Select(task => task.Result);
    dbContext.AddRange(loadedStrongTypes);
}

Because code works with external resource, parallel approach will waste resources by creating many threads which doing nothing - only waiting for a response.
With async-await you will be able to send all requests without waiting for responses.
And when all responses successfully arrived you can proceed with processing received data.

Fabio
  • 31,528
  • 4
  • 33
  • 72
  • if i have 150,000 requests to make this will attempt to create many open connections though right? More than my default connection limit allows. so in that case I should feed Run a batch and await it inside a loop of batches? – Craig Feb 19 '19 at 20:52
  • If client has limit of pending requests, then you can use your approach with a batches. – Fabio Feb 19 '19 at 22:03
  • @Craig, my point was that using parallel approach(multiple threads) probably isn't correct approach for IO operations. – Fabio Feb 19 '19 at 22:04
  • i dont really understand the distinction since either way we are creating threads to do the work which will be released relatively quickly for a small batch – Craig Feb 19 '19 at 22:17
  • @Craig with IO operations async-await will use only one thread. – Fabio Feb 20 '19 at 00:20
  • Worth clarifying that the thread which async-await uses is the UI thread and as such it doesn't create any threads, it removes the need to worry about thread synchronization where, for example, you write to a `dbContext` from multiple threads in your original post. See [this answer](https://stackoverflow.com/a/37419845/5062791) for a very good explanation about async-await. – ColinM Feb 20 '19 at 08:47