2

Goal: foreach item in a list of S3 URIs, get the # of objects.

My .Net Core 3.1 Console app works great when run from VS 2019, but has problems when run from cmd (or Task Scheduler, .bat file, etc) once the list size gets above 5000 items or so.

Things seem okay until it gets down to around 500-1000 tasks remaining. Then, about 75% of the time, the remaining tasks never seem to complete and the app hangs forever... although the RAM usage dwindles down to just about nothing in Task Manager.

I'm fairly new to Async, and I've tried refactoring a bunch based on the myriad of solutions I see out there, but just can't seem to figure it out.

Items of note:

  • In VS, tasks seem to come back faster over time, so my first 1000 tasks might take 10s, the next take 9s, etc. Outside of VS, it seems to be the opposite, they come back slower over time
  • I run this app on an AWS EC2, a t3a.2xlarge w/ 32GB of RAM
  • When I run it using PowerShell, sometimes during the run, it'll disconnect me from RDP, sometimes multiple times.
  • In VS the app uses about 75MB w/ a small list of URIs, about about 600MB with a list of 150k. Outside of VS it uses about 4x more RAM.
  • I tried compiling as both 32bit and 64bit

Code:

namespace MyNamespace
{
    public class MyClass
    {
        private static DataTable dt;
        private static IAmazonS3 clientS3;

        static async Task Main(string[] args)
        {
            dt = <Call DB, get S3 URIs>;
            clientS3 = new AmazonS3Client();

            IEnumerable<Task<int>> callApiTasksQuery = from row in dt.AsEnumerable() select GetS3DataAsync(row);
            List<Task<int>> apiTasks = callApiTasksQuery.ToList();

            int total = 0;
            while (apiTasks.Any())
            {
                // if (apiTasks.Count % 100 == 0) await Console.Out.WriteLineAsync($"{apiTasks.Count} remaining.");
                Task<int> finishedTask = await Task.WhenAny(apiTasks);
                apiTasks.Remove(finishedTask);
                total += await finishedTask;
            }
        }
        
        static async Task<int> GetS3DataAsync(DataRow row)
        {
            var response = await clientS3.ListObjectsV2Async(new ListObjectsV2Request { BucketName = row[0], Prefix = row[1] });
            // Console.WriteLine(response.S3Objects.Count().ToString());  
            return 1;
        }
    }
}
kintax
  • 61
  • 7
  • 2
    Why DoMainAsync? Why not just put the logic from there into Main, mark it as `async`, and then eliminate the extra method? – mason Apr 01 '21 at 19:44
  • I used DoMainAsync b/c that was recommended on some other threads I found. That said, I tried removing that line entirely and going with the typical entry point of ```static async Task Main(string[] args)``` but it has no impact. – kintax Apr 01 '21 at 19:48
  • 1
    "it has no impact" - you mean except making your code cleaner? It's definitely what you should be doing. I haven't worked with AWS much, but it appears you're making a significant number of requests to an API. I would expect that to be heavily I/O bound, so you really shouldn't do more than a few at a time. And likely there's a rate limit for how often that API will allow you to hit it. – mason Apr 01 '21 at 19:51
  • The DoMainAsync syntax seems to have been an older convention. I've edited my original post, although the issue is identical regardless of entry point syntax. – kintax Apr 01 '21 at 19:59
  • Are you aware that you are bombarding the remote server with 5,000 concurrent requests? Maybe not responding is the server's defense against what appears like a DoS attack. – Theodor Zoulias Apr 01 '21 at 20:06
  • Please feel free to read the post before responding to it. Everything runs fine when run through VS. I call the API 100,000 times in 2 minutes and get all results back in 5 minutes without the cpu or ram going over 5%. Outside of VS it hangs when I try to get 3,000. – kintax Apr 01 '21 at 20:11
  • 1
    Agreed, sounds like you want to limit the amount of concurrent requests as described here: https://stackoverflow.com/questions/10806951/how-to-limit-the-amount-of-concurrent-async-i-o-operations – Jonas Høgh Apr 01 '21 at 20:14
  • I will try that, but, if that was the issue, wouldn't I have the same problems when running through VS? – kintax Apr 01 '21 at 20:21
  • 1
    It's possible that running in VS code is actually slowing it down enough (due to network latency) that it isn't running into rate limiting. Calling it from the EC2 instance might be faster in making the calls, thus hitting rate limiting. Do you have the same issue running the code locally but not in VS Code? – Jason Wadsworth Apr 01 '21 at 20:24
  • I'm not using VS Code. I'm using VS 2019. I cannot run it locally unfortunately due to authentication requirements. It runs at roughly the same speed whether run from VS or PowerShell, unless a. the initial request was for 3000+ AND b. all but about 500 tasks have completed. I know this because I've done a ton of logging which i removed from the code in my post, but it shows the stopwatch times when tasks are created and returned, etc. – kintax Apr 01 '21 at 20:28
  • I can actually run it on over 1,000,000 records if I do it from VS, and there are no issues. But when run from outside VS, whether I try to get 3,000 or 300,000, it hangs once it gets down to about 500. – kintax Apr 01 '21 at 20:47
  • BTW the AWS API will throw exceptions when you exceed the rate limit. I'm not getting any exceptions. – kintax Apr 01 '21 at 21:03
  • If you can run it in VS you can run it outside of VS. You have the credentials in VS somewhere. – Jason Wadsworth Apr 01 '21 at 21:22
  • 1
    Have you tried running it from the Visual Studio, but without the debugger attached? (Ctrl+F5) – Theodor Zoulias Apr 01 '21 at 21:40
  • @TheodorZoulias I just tried, it works and completes almost immediately, much faster than through PowerShell or VS debugger. Wonder what that means! – kintax Apr 02 '21 at 14:44
  • @JasonWadsworth sorry, I was unclear... I meant I cannot run it from my desktop. It has to be run from inside my VPC in order to access this S3 bucket. – kintax Apr 02 '21 at 14:46
  • So you have VS installed on an EC2 instance? – Jason Wadsworth Apr 02 '21 at 15:34
  • Yes. I have a t3a.2xlarge Win 2019 and VS 2019 installed. I RDP into it and run the app from there. Running it thru VS via F5 works every time with any amount of tasks. running it via ctrl+f5 I thought worked every time, I did it a few times, but now it is behaving like when I run it thru PS, launching it via a .bat, kicking it off via task scheduler... it fails almost every time if the # of tasks is over 5000ish, starting to die when the # of tasks remaining is in the 500-1000 range, whether i started 5000 tasks or 50000 – kintax Apr 02 '21 at 15:49

2 Answers2

1

The only problem I see is in this code, which operates in O(n^2) time:

int total = 0;
while (apiTasks.Any())
{
  // if (apiTasks.Count % 100 == 0) await Console.Out.WriteLineAsync($"{apiTasks.Count} remaining.");
  Task<int> finishedTask = await Task.WhenAny(apiTasks);
  apiTasks.Remove(finishedTask);
  total += await finishedTask;
}

If the output is not necessary, then replace this with a single Task.WhenAll:

var totals = await Task.WhenAll(apiTasks);
var total = totals.Sum();

If you do need the output, then you could reorder by completion once and then await each one. There's some blogs on how to do that, or you can use Nito.AsyncEx:

int total = 0;
var orderedApiTasks = apiTasks.OrderByCompletion();
for (int i = 0; i != orderedApiTasks.Count; ++i)
{
  total += await orderedApiTasks[i];
  if (i % 100 == 0) await Console.Out.WriteLineAsync($"{orderedApiTasks.Count - i} remaining.");
}
Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
1

The following batched solution worked. It brings back each batch in 2-3s (~10s if run in debugger)

Credit to https://www.michalbialecki.com/2018/04/19/how-to-send-many-requests-in-parallel-in-asp-net-core/ and thanks to all for assisting!

using System;
using System.Threading.Tasks;
using System.Collections.Generic;
using System.Data;
using Amazon.S3;
using System.Linq;
using Amazon.S3.Model;

namespace MyNamespace
{
    public class S3PrefixGrabber
    {
        private static IAmazonS3 clientS3;

        static async Task Main(string[] args)
        {
            var query = "SELECT bucket,prefix from myTable";
            DataTable dt = GetStuffFromDB(query);
            List<S3Prefix> unpopulatedList = (from DataRow dr in dt.Rows select new S3Prefix() { B = dr[0].ToString(), P = dr[1].ToString() }).ToList();

            var batchSize = 1000;
            int numberOfBatches = (int)Math.Ceiling((double)unpopulatedList.Count() / batchSize);
            List<S3Prefix> populatedList = new List<S3Prefix>();

            for (int i = 0; i < numberOfBatches; i++)
            {
                var currentItems = unpopulatedList.Skip(i * batchSize).Take(batchSize);
                var tasks = currentItems.Select(id => GetS3DataAsync(id));
                populatedList.AddRange(await Task.WhenAll(tasks));
            }
        }

        static async Task<S3Prefix> GetS3DataAsync(S3Prefix s3Item)
        {
            clientS3 = new AmazonS3Client();
            var response = await clientS3.ListObjectsV2Async(new ListObjectsV2Request { BucketName = s3Item.B, Prefix = s3Item.P });
            s3Item.O = response.S3Objects.Count;

            return s3Item;
        }
    }

    public class S3Prefix
    {
        public string B { get; set; }
        public string P { get; set; }
        public int O { get; set; }
    }
}

Running 10k records, RAM is at 75MB and CPU 40%
Running 300k records, RAM is at 700MB and CPU 40%

A snippet from the log (which I didn't include in my code above) just as an FYI:

06:32:52.310: ================= STARTING =================
06:32:52.795: Query: SELECT bucket,prefix FROM myTable
06:32:52.874: Opening Connection
06:32:54.205: Filling adapter
06:33:06.309: 313863 rows returned from DB
06:33:07.647: Batching... Batch size: 1000 Batches: 314
06:33:07.647: Starting batch 1/314... Done in 02.84s.
06:33:10.492: Starting batch 2/314... Done in 02.48s.
06:33:12.977: Starting batch 3/314... Done in 02.19s.
...
06:38:55.435: Starting batch 150/314... Done in 02.32s.
06:38:57.761: Starting batch 151/314... Done in 02.17s.
06:38:59.936: Starting batch 152/314... Done in 02.27s.
...
06:45:13.579: Starting batch 312/314... Done in 02.17s.
06:45:15.751: Starting batch 313/314... Done in 02.35s.
06:45:18.105: Starting batch 314/314... Done in 02.10s.
06:45:20.211: Writing 313863 rows to CSV... Done.
06:45:23.086: DB Rows: 313863 CSV Rows: 313863 NotInS3: 0 InS3ButNotFound: 0
06:45:23.087: Done in 12:30.77s.
06:45:23.092: ================= ENDING =================

kintax
  • 61
  • 7
  • Great that you got it working. If performance is important, you may want to try using e.g. ParallelForEachAsync from https://github.com/Dasync/AsyncEnumerable, to implement a sliding window of N active tasks instead of waiting for all of the N tasks to complete in every batch. – Jonas Høgh Apr 06 '21 at 07:05