3

I need to use proxies to download a forum. The problem with my code is that it takes only 10% of my internet bandwidth. Also I have read that I need to use a single HttpClient instance, but with multiple proxies I don't know how to do it. Changing MaxDegreeOfParallelism doesn't change anything.

public static IAsyncEnumerable<IFetchResult> FetchInParallelAsync(
    this IEnumerable<Url> urls, FetchContext context)
{
    var fetchBlcock = new TransformBlock<Url, IFetchResult>(
        transform: url => url.FetchAsync(context), 
        dataflowBlockOptions: new ExecutionDataflowBlockOptions 
        {
            MaxDegreeOfParallelism = 128
        }
    );
    foreach(var url in urls)
        fetchBlcock.Post(url);

    fetchBlcock.Complete();
    var result = fetchBlcock.ToAsyncEnumerable();
    return result;
}

Every call to FetchAsync will create or reuse a HttpClient with a WebProxy.

public static async Task<IFetchResult> FetchAsync(this Url url, FetchContext context)
{
    var httpClient = context.ProxyPool.Rent();
    var result = await url.FetchAsync(httpClient, context.Observer, context.Delay,
        context.isReloadWithCookie);
    context.ProxyPool.Return(httpClient);
    return result;
}

public HttpClient Rent() 
{
    lock(_lockObject)
    {
        if (_uninitiliazedDatacenterProxiesAddresses.Count != 0)
        {
            var proxyAddress = _uninitiliazedDatacenterProxiesAddresses.Pop();
            return proxyAddress.GetWebProxy(DataCenterProxiesCredentials).GetHttpClient();
        }
        return _proxiesQueue.Dequeue();
    }
}

I am a novice at software developing, but the task of downloading using hundreds or thousands of proxies asynchronously looks like a trivial task that many should have been faced with and found a correct way to do it. So far I was unable to find any solutions to my problem on the internet. Any thoughts of how to achieve maximum download speed?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Max
  • 39
  • 1
  • 3
  • 1
    Take a look at this: [What is HttpClient's default maximum connections](https://stackoverflow.com/questions/31735569/what-is-httpclients-default-maximum-connections) – Theodor Zoulias Oct 31 '20 at 08:23
  • 1
    `.ConfigureAwait(false);` check if this helps add to end of `await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);` – Seabizkit Oct 31 '20 at 10:03

1 Answers1

1

Let's take a look at what happens here:

var result = await url.FetchAsync(httpClient, context.Observer, context.Delay, context.isReloadWithCookie);

You are actually awaiting before you continue with the next item. That's why it is asynchronous and not parallel programming. async in Microsoft docs

The await keyword is where the magic happens. It yields control to the caller of the method that performed await, and it ultimately allows a UI to be responsive or a service to be elastic.

In essence, it frees the calling thread to do other stuff but the original calling code is suspended from executing, until the IO operation is done.

Now to your problem:

  1. You can either use this excellent solution here: foreach async
  2. You can use the Parallel library to execute your code in different threads.

Something like the following from Parallel for example

Parallel.For(0, urls.Count,
         index => fetchBlcock.Post(urls[index])
});
Athanasios Kataras
  • 25,191
  • 4
  • 32
  • 61
  • Athanasios `Post`ing to a dataflow block is practically instantaneous. There is no need to parallelize it with `Parallel.For`. – Theodor Zoulias Oct 31 '20 at 08:33
  • I based my answer on this mostly https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.dataflowblock.post?view=netcore-3.1#System_Threading_Tasks_Dataflow_DataflowBlock_Post__1_System_Threading_Tasks_Dataflow_ITargetBlock___0____0_ `For target blocks that support postponing offered messages, or for blocks that may do more processing in their Post implementation, consider using SendAsync, which will return immediately` infering that it is not instantaneous. I'm not really an expert on Dataflows though, so it's just a suggestion. – Athanasios Kataras Oct 31 '20 at 08:42
  • 2
    The `Post` method returns immediately, and the return value indicates whether the message was accepted or not. The `SendAsync` method also returns immediately, and the return value is a `Task` that completes when the dataflow block has definitely accepted or rejected the message. The difference between the two is in case of postponement. The `Post` interprets a `Postponed` response as not-accepted, while the `SendAsync` interprets it as a promise that may fulfilled later and returns an incomplete task. In practice you need to use `SendAsync` with blocks configured with `BoundedCapacity`. – Theodor Zoulias Oct 31 '20 at 09:07