2

I have a List of URLs which I have to call and do some work on. This already works fine but the List is very large and execution takes very long.

I think I could speed up the programm by working on 5 Urls at the same time as a huge part of the executiontime is probaply the programm waiting for the Urls serverresponse.

I have a List of URLs

List<string> urls = getmyurls();

And then I'm iterating through them

for (int i = 0; i < links.Count; i++)
{
    List<string> result = dosomework(urls.ElementAt(i))
    urls.AddRange(result);
}

Sometimes I get some additional Urls returned which have to be processed too.

(Code is an example, my actual programm is structured a little bit diffrent. It's a minimal example to explain my problem.)

What I want is five threads running the function "dosomework" at the same time. Whenever one of them is finished I want it to start on the next URL.

Also: How many threads would you run?

Savad KP
  • 1,625
  • 3
  • 28
  • 40
B5-NDDT
  • 177
  • 2
  • 14
  • Try searching. Parallel taks library. – CodeCaster Feb 04 '16 at 12:07
  • 1
    If you want to go fast, abandon your simplistic idea that more threads = more speed. Learn how to do asynchronous IO rather than blocking IO. – spender Feb 04 '16 at 12:10
  • As CodeCaster mentioned, TPL is a good place to start. It will determine for you how many tasks to run in parallel etc. And Parallel.For is very simple to put in – Sami Kuhmonen Feb 04 '16 at 12:10
  • https://msdn.microsoft.com/en-us/library/ff963547.aspx jump to parallel foreach – Jose Rocha Feb 04 '16 at 12:10
  • You definitely might want to have a look to the [TPL Dataflow](https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx) – Larry Feb 05 '16 at 09:06

3 Answers3

1

When you are attempting to resolve a URL and pull from the network, it's similar to pulling from the disk or reading from a database in that these are all I/O bound operations. Going parallel is actually not desirable as more threads doesn't help but rather hinder the performance. Your best bet is to utilize the async and await keywords, assuming you're on .NET 4.5.

Some people are suggesting Parallel.ForEach but this is best suited for CPU-bound tasks. For I/O bound tasks, you need Task.WhenAll.

Here is great video demonstration on Performing I/O-Bound Asynchronous Operations by Jeffrey Richter. I strongly encourage watching it. In the meantime, I would write your iterations calls like so.

    private static IEnumerable<string> GetUrls()
    {
        return new[] { "https://stackoverflow.com/", "http://www.google.com/" };
    }

    internal async Task Fetch()
    {
        var urls = GetUrls();
        var tasks = urls.Select(DoWorkAsync);
        await Task.WhenAll(tasks);
    }

    internal Task DoWorkAsync(string url)
    {
        // TODO: Implement actual work on the URL in an async manner.
        return Task.FromResult(url);
    }  

The idea being that you can get the URL's, and from each of the URL's select a task that executes on DoWorkAsync. All of these are then awaited.

Update

It appears as though throttling has already been answered here.

Community
  • 1
  • 1
David Pine
  • 23,787
  • 10
  • 79
  • 107
  • But how do I know (or influence) how many Urls are called at the same time. It looks like the programm is calling all the urls as soon as possible. – B5-NDDT Feb 04 '16 at 12:48
  • That is a different question, I answered the original. I would imagine if the list is potentially huge, you could implement some sort of buffer or queue and then batch calls out to N per time. – David Pine Feb 04 '16 at 12:50
  • Thats the original problem. The List is very huge. More than a 1000 Urls initally and each task could potentially add a few more. – B5-NDDT Feb 04 '16 at 12:52
  • This seems to solve the problem. Even though I didn't have time to try it out. Thank you – B5-NDDT Feb 04 '16 at 13:02
  • I believe async + TPL Dataflow are the way to go. – Larry Feb 05 '16 at 09:30
0

I am a huge fan of the TPL Dataflow library. It totally fits this use case and worth to be learnt.

Here is a raw implementation to show you a bit how it works.

var processURL = new TransformManyBlock<string, string>(url => {
    return dosomework(url);
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });

var urls = getmyurls();
foreach(var url in urls)
    processURL.Post(url);

processURL.Completion.Wait();
var results = processURL.Receive();

A nice example of process pipeline can be read here.

Larry
  • 17,605
  • 9
  • 77
  • 106
-1

What you are looking for is probably Parallel LINQ.

Consider the example from https://msdn.microsoft.com/pl-pl/library/dd460714(v=vs.110).aspx

EDIT: As it comes to running on multiple threads you add WithDegreeOfParallelism(6) where 6 is "thread" count. This not exactly 6 thread but is what you want :) Here you have a nice explanation: http://www.albahari.com/threading/part5.aspx

Also ParallelOptions.MaxDegreeOfParallelism specifies maximum level of parallelism

badsamaritan
  • 407
  • 3
  • 9
  • But would'nt that execute all my Urls at once. How can I determine how many parrellel threads are allowed at once? – B5-NDDT Feb 04 '16 at 12:18
  • " I think I could speed up the programm by working on 5 Urls at the same time " I understood this is what you wanted :) – badsamaritan Feb 04 '16 at 14:32
  • Maybe I dont understand it completly. I want to execute my functions for all Urls but only 5 at a time. Whenever I'm done with one I want to start on the next. – B5-NDDT Feb 04 '16 at 15:20
  • Ok, assuming you have some actual CPU-bound tasks to do with these links what I wrote may actually help you. Consider this example: Lets say you have big list of numbers and you want to calculate sqrt for everyone. You can do this `var querySequential = (from item in test2 select Math.Sqrt(item)).ToArray()`. To make it paralell just change it to: `var queryParallel = (from item in test2.AsParallel().AsOrdered().WithDegreeOfParallelism(5) select Math.Sqrt(item)).ToArray()` I've checked it, and parallel is about 3 times faster then sequential so it works :) Of course is calculated 5 at a time – badsamaritan Feb 04 '16 at 16:48
  • Thank you, this is really intressting and will most definitly try this out, but in this specially case I have the problem that the results of the tasks might add additional tasks to the queue. That will be hard to incorporate here. – B5-NDDT Feb 04 '16 at 20:58