0
public string SavePath { get; set; } = @"I:\files\";

public void DownloadList(List<string> list)
{
    var rest = ExcludeDownloaded(list);
    var result = Parallel.ForEach(rest, link=>
    {
        Download(link);
    });
}

private void Download(string link)
{
    using(var net = new System.Net.WebClient())
    {
        var data = net.DownloadData(link);

        var fileName = code to generate unique fileName;
        if (File.Exists(fileName))
            return;

        File.WriteAllBytes(fileName, data);
    }
}

var downloader = new DownloaderService();
var links = downloader.GetLinks();
downloader.DownloadList(links);

I observed the usage of RAM for the project keeps growing enter image description here

I guess there is something wrong on the Parallel.ForEach(), but I cannot figure it out.

Is there the memory leak, or what is happening?


Update 1

After changed to the new code

private void Download(string link)
{
    using(var net = new System.Net.WebClient())
    {
        var fileName = code to generate unique fileName;
        if (File.Exists(fileName))
            return;
        var data = net.DownloadFile(link, fileName);
        Track theTrack = new Track(fileName);
        theTrack.Title = GetCDName();
        theTrack.Save();
    }
}

enter image description here

I still observed increasing memory use after keeping running for 9 hours, it is much slowly growing usage though.

Just wondering, is it because that I didn't free the memory use of theTrack file?

Btw, I use ALT package for update file metadata, unfortunately, it doesn't implement IDisposable interface.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Franva
  • 6,565
  • 23
  • 79
  • 144
  • 3
    It's not necessarily a memory leak, it may must be that the garbage collector hasn't run. – Sean Mar 30 '20 at 10:58
  • 1
    `File.WriteAllBytes(fileName, data);` Which size of `data` do you have? Have you tried to write by chunks? – Pavel Anikhouski Mar 30 '20 at 10:59
  • Since you're processing the items parallel, you hold the content of each downloaded file in the memory - multiple at the same time.That's why it increases, especially if they are large files. As @Sean mentioned, the GC most likely hasn't cleaned up yet. – devsmn Mar 30 '20 at 11:05
  • hi @PavelAnikhouski most of time, the files will be less than 1MB, but it could also be more than 100MB, the biggest so far is 155MB. How to write by chunks? – Franva Mar 30 '20 at 11:09
  • Why are you not using `WebClient.DownloadFile()` to directly download to a file? – riQQ Mar 30 '20 at 11:11
  • @Shawn I have let the program run for more than 1 hour, wouldn't GC run at least once? How to release the content of each downloaded file after it is saved? – Franva Mar 30 '20 at 11:11
  • 1
    @riQQ DownloadFile() fixed the issue. thanks ! – Franva Mar 30 '20 at 11:16
  • hi @Shawn you are correct, after I removed the File.WriteAllBytes() and use DownloadFile() directly, my memory leak has gone. thanks – Franva Mar 30 '20 at 11:17

2 Answers2

4

The Parallel.ForEach method is intended for parallelizing CPU-bound workloads. Downloading a file is an I/O bound workload, and so the Parallel.ForEach is not ideal for this case because it needlessly blocks ThreadPool threads. The correct way to do it is asynchronously, with async/await. The recommended class for making asynchronous web requests is the HttpClient, and for controlling the level of concurrency an excellent option is the TPL Dataflow library. For this case it is enough to use the simplest component of this library, the ActionBlock class:

async Task DownloadListAsync(List<string> list)
{
    using (var httpClient = new HttpClient())
    {
        var rest = ExcludeDownloaded(list);
        var block = new ActionBlock<string>(async link =>
        {
            await DownloadFileAsync(httpClient, link);
        }, new ExecutionDataflowBlockOptions()
        {
            MaxDegreeOfParallelism = 10
        });
        foreach (var link in rest)
        {
            await block.SendAsync(link);
        }
        block.Complete();
        await block.Completion;
    }
}

async Task DownloadFileAsync(HttpClient httpClient, string link)
{
    var fileName = Guid.NewGuid().ToString(); // code to generate unique fileName;
    var filePath = Path.Combine(SavePath, fileName);
    if (File.Exists(filePath)) return;
    var response = await httpClient.GetAsync(link);
    response.EnsureSuccessStatusCode();
    using (var contentStream = await response.Content.ReadAsStreamAsync())
    using (var fileStream = new FileStream(filePath, FileMode.Create,
        FileAccess.Write, FileShare.None, 32768, FileOptions.Asynchronous))
    {
        await contentStream.CopyToAsync(fileStream);
    }
}

The code for downloading a file with HttpClient is not as simple as the WebClient.DownloadFile(), but it's what you have to do in order to keep the whole process asynchronous (both reading from the web and writing to the disk).


Caveat: Asynchronous filesystem operations are currently not implemented efficiently in .NET. For maximum efficiency it may be preferable to avoid using the FileOptions.Asynchronous option in the FileStream constructor.


.NET 6 update: The preferable way for parallelizing asynchronous work is now the Parallel.ForEachAsync API. A usage example can be found here.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • There's also `WebClient.DownloadFileAsync` – riQQ Mar 30 '20 at 13:02
  • 1
    @riQQ indeed. But take a look at the [remarks](https://learn.microsoft.com/en-us/dotnet/api/system.net.webclient#remarks): *Important - We don't recommend that you use the WebClient class for new development. Instead, use the System.Net.Http.HttpClient class.* – Theodor Zoulias Mar 30 '20 at 13:05
  • Interesting, but sadly there's no reason given why they don't recommend the use for new development. – riQQ Mar 30 '20 at 13:07
  • 1
    @riQQ the `HttpClient` is supposed to offer better performance because it reuses connections. This is because you don't have to create a new instance for every request, like you have to do with the `WebClient`. – Theodor Zoulias Mar 30 '20 at 13:18
  • 1
    beautiful solution with rich explanation, in answer and in comments~! Great example~! thanks! I will give it a try during lunch break (I started working already >_<;) – Franva Mar 30 '20 at 21:08
  • hi @TheodorZoulias I have tried the code, I have not found increase in performance. It is even slower than the Parallel.Foreach(). – Franva Apr 09 '20 at 10:15
  • @Franva have you experimented with various values for `MaxDegreeOfParallelism`? – Theodor Zoulias Apr 09 '20 at 10:18
  • Nope, I only tried with value 50 as I use 50 for Parallel.Foreach() as well, so I want to make sure they are utilizing same number of threads. – Franva Apr 09 '20 at 12:05
  • 1
    @Franva if you want to make a fair comparison you should add this line before the `Parallel.ForEach` method: `ThreadPool.SetMinThreads(50, 10);`, so that the `ThreadPool` has 50 threads available to execute 50 concurrent web requests. Without this line you are introducing an implicit throttling caused by the starved `ThreadPool`. The outcome of your test indicates that 50 concurrent operations are too many for the remote server to handle, or for the web connection, or for the storage hardware. I suggest that you experiment with a smaller concurrency level. 50 seems to be far from optimal. – Theodor Zoulias Apr 09 '20 at 12:41
  • 1
    thanks @TheodorZoulias I will try it and update you. Appreciate your help and explanation. I have learnt insight for the multithreading programming. – Franva Apr 10 '20 at 10:35
3

Use WebClient.DownloadFile() to download directly to a file so you don't have the whole file in memory.

riQQ
  • 9,878
  • 7
  • 49
  • 66
  • just for curiosity, if I used File.WriteAllBytes(), how could I free its memory after each use? – Franva Mar 30 '20 at 11:19
  • @Franva Basically, you're allocating large byte arrays. They are handled specifically by .NET. You can read up more here: https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/large-object-heap – riQQ Mar 30 '20 at 11:28
  • 1
    `GC.Collect()` frees the memory, if you don't hold any references to the object / byte array anymore. – riQQ Mar 30 '20 at 11:33