1

I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.

I need to process these urls in parallel with adherence to provider's rules.

This is my current code:

static void Main(string[] args)
{
    process_urls().GetAwaiter().GetResult();

}
public static async Task process_urls()
{
    // let's say there is a list of 50,000+ URLs
    var urls = System.IO.File.ReadAllLines("urls.txt");

    var allTasks = new List<Task>();
    var throttler = new SemaphoreSlim(initialCount: 5);

    foreach (var url in urls)
    {
        await throttler.WaitAsync();

        allTasks.Add(
            Task.Run(async () =>
            {
                try
                {
                    Console.WriteLine(String.Format("Starting {0}", url));
                    var client = new HttpClient();
                    var xml = await client.GetStringAsync(url);
                    //do some processing on xml output
                    client.Dispose();
                }
                finally
                {
                    throttler.Release();
                }
            }));
    }   
    await Task.WhenAll(allTasks);   
}

Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.

Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?

Thanks

PyQL
  • 1,830
  • 3
  • 18
  • 22

1 Answers1

2

Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.

public static async Task process_urls()
{
    var urls = System.IO.File.ReadAllLines("urls.txt");

    var allTasks = new List<Task>();
    var throttler = new SemaphoreSlim(initialCount: 5);

    foreach (var urlGroup in SplitToGroupsOfFive(urls))
    {
        var tasks = new List<Task>();
        foreach(var url in urlGroup)
        {
            var task = ProcessUrl(url);
            tasks.Add(task);
        }
        // This delay will sure that next 5 urls will be used only after 1 seconds
        tasks.Add(Task.Delay(1000));

        await Task.WhenAll(tasks.ToArray());
    }
}

private async Task ProcessUrl(string url)
{
    using (var client = new HttpClient())
    {
        var xml = await client.GetStringAsync(url);
        //do some processing on xml output
    }
}

private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
    var const GROUP_SIZE = 5;
    var string[] group = null;
    var int count = 0;

    foreach (var url in urls)
    {
        if (group == null)
            group = new string[GROUP_SIZE];

        group[count] = url;
        count++;

        if (count < GROUP_SIZE) 
            continue;

        yield return group;

        group = null;
        count = 0;
    }

    if (group != null && group.Length > 0)
    {
        yield return group.Take(group.Length);
    }
}

Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.

Fabio
  • 31,528
  • 4
  • 33
  • 72
  • `Task.Delay(5000)` will make sure that the task will complete after 5 seconds?. What I need is to make sure that only 5 tasks will run in 1 second. And yes the calculation is another async task that will write the output into a text file. – PyQL Dec 12 '16 at 06:26
  • `Task.Delay(1000)` was added to the collection of tasks, then `Task.WhenAll` will sure that next 5 urls will be processed at least after 1 seconds or when all 5 urls have processed. – Fabio Dec 12 '16 at 07:49
  • Thanks for the modification, so what is `SplitToGroupsOfFive` – PyQL Dec 12 '16 at 09:58
  • Added `SplitToGroupsOfFive` method implementation – Fabio Dec 12 '16 at 10:22