Efficient way to download a huge load of files in parallel

Question

I'm trying to download a huge load of files(pictures) from the internet. I'm stuggling with async/parallel, because

a) I cant say whether there is a file, or not. I just got a million links provided with either a singe picture (300kb to 3MB) or 404 page does not exist. So to escape downloading an 0-Byte file, i ask the same page twice, once for 404 and after that for the picture. THe other way would be downloading all 0-byte files and deleting millions of them afterwards - which keeps windows 10 stuck on this task until i reboot.

b) While the (very slow) download is in progress, whenever I have a look at any of the "successfull downloaded files", it is created with 0 bytes and dont contain the picture. What do I need to change, to really download the file before going to download the next one?

How do i fix this both issues? Is there any better way to download tousands or millions of files (compression/creating .zip on the server is not possible)

           //loopResult = Parallel.ForEach(_downloadLinkList, new ParallelOptions { MaxDegreeOfParallelism = 10 }, DownloadFilesParallel);    
            private async void DownloadFilesParallel(string path)
            {
                string downloadToDirectory = ""; 
                string x = ""; //in case x fails, i get 404 from webserver and therefore no download is needed
                System.Threading.Interlocked.Increment(ref downloadCount);
                OnNewListEntry(downloadCount.ToString() + " / " + linkCount.ToString() + " heruntergeladen"); //tell my gui to update
                try
                {
                    using(WebClient webClient = new WebClient())
                    {
                        downloadToDirectory = Path.Combine(savePathLocalComputer, Path.GetFileName(path)); //path on local computer

                        webClient.Credentials = CredentialCache.DefaultNetworkCredentials;
                        x = await webClient.DownloadStringTaskAsync(new Uri(path)); //if this throws an exception, ignore this link
                        Directory.CreateDirectory(Path.GetDirectoryName(downloadToDirectory)); //if request is successfull, create -if needed- the folder on local pc
                        await webClient.DownloadFileTaskAsync(new Uri(path), @downloadToDirectory); //should download the file, release 1 parallel task to get the next file. instead there is a 0-byte file and the next one will be downloaded
                    }
                }
                catch(WebException wex)
                {
                }
                catch(Exception ex)
                {
                    System.Diagnostics.Debug.WriteLine(ex.Message);
                }
                finally
                {
                    
                }
            }

//picture is sfw, link is nsfw

*Is there any better way* - yes. Can you define better in a way that lets us work on one problem at once, or are you looking for us to write your download manager for you? Which reminds me - there are boatloads of download managers out there that you could hand your million links to and press go; with so many available dogs, why are you barking yourself? — Caius Jard, Jul 14 '20 at 20:03
You're not explicitly checking the WebException for a 404 status code. If the download request starts but then times out, the corresponding file will be empty and your application will carry on as if it completed successfully. — Andrew Williamson, Jul 14 '20 at 20:13
@CaiusJard no, I don't want you to write an downloadmanager for me. Since this is part of an bigger solution/project (im using it with porn pictures, because the real server isnt available at the moment and this page is close to the real. As soon at it is, i need to download, store, manipulate.. a million hand drawings of machines) using an downloadmanager is no option for me. @ AndrewWilliamson so how to handle it correctly? I let it run for 30min now; after the last file finished, all the downloaded 0-byte pictures were poplated at once with the corresponding pictures — Hans Jo, Jul 14 '20 at 20:19
ps; there's nothing parallel about that code block - where is the parallelism?, and there is no need for this "downloading twice to avoid a 0 byte file" approach. Really confused as to why an already rolled solution is no good because the screenshot looks like you're writing a winforms based download manager - surely something that exists already. Heck, even taking your million links and splitting them into 10 files of 100k links each, then using find/repl to put "wget" or "curl" at the start of every line would be quicker and easier.. — Caius Jard, Jul 14 '20 at 20:19
Try to avoid `Parallel.ForEach` and `async` (not *impossible*, but tricky). See these: [Throttling asynchronous tasks](https://stackoverflow.com/q/22492383/7444103) (see `noseratio` answer), [Queue of async tasks with throttling which supports muti-threading](https://stackoverflow.com/q/34315589/7444103). On the other hand, you want to catch an exception before the download starts, so Webclient may not be the right choice. HttpClient works better in both departments. — Jimi, Jul 14 '20 at 20:22
@CaiusJard so hwo to avoid? If i use await webClient.DownloadFileTaskAsync() and this is a 404 page, it runs to the catch, but creates an 0-byte file. Ive got about 2Million 0-byte files aside ~6.000 valid pictures within the last 2 nights. — Hans Jo, Jul 14 '20 at 20:23
You delete them as you go, or you don't increment the file numbering/naming strategy upon a 404 but only upon successful download, so that the 0 byte file caused by this 404 is replaced by a real file next time that doesn't 404. It's gonna be much faster to ask your local filesystem the length of file X than it is to ask a remote web server to give you a 404 and skip a download, but as noted you can use the absence of a 404 to roll your file number on by one simply by putting th increment *after* the DownloadFile rather than before it — Caius Jard, Jul 14 '20 at 20:31
@Jimi I just want to avoid creating empty files whenever the url points to an 404-Page. If there is a better/faster/easier way (creation millions of empty files and deleting afterwards is a bad Idea on indexed windows server) Id like to go this way. My first try was to use multiple threads, but i always run into deadlocks... Ill have a look ath noseratio's answer; looks interesting — Hans Jo, Jul 14 '20 at 20:31
@CaiusJard the naming is given by the website like "google.de/folder1/file1" -> D:Downloads/Folder1/file1.jpg Yes, delete when I go is a -bad- possibility because it will run on an indexed windows server. each time a file is created/deleted, the index will refesh --> million over million times. Say i get a million links, where only 10k are valid. I just want to create the 10k folders/files — Hans Jo, Jul 14 '20 at 20:35
You're claiming that every time you add a 0 byte file to a disk on which Windows Search indexer is running, it will rebuild the entire index? Can you cite a source for this behavior? — Caius Jard, Jul 15 '20 at 05:05
@CaiusJard sure I can. Apply to the company Im working for, write an email to the chinese/rumanian network&server admins and ask them, why they set up the configuration, that every new/deleted file needs to refresh the entire index. I dont know, what pro's this have.. Just got a c# project file to improve/fix bcs the colleague is in holiday for six weeks and the project needs to go on ¯\_(ツ)_/¯ — Hans Jo, Jul 15 '20 at 07:28
Sounds like a configuration that one wouldn't want to enable on a server, that's for sure! — Caius Jard, Jul 15 '20 at 14:18

aepot · Accepted Answer · 2020-07-15T10:33:27.107

Here's the example using HttpClient with limit of maximum concurrent downloads.

private static readonly HttpClient client = new HttpClient();

private async Task DownloadAndSaveFileAsync(string path, SemaphoreSlim semaphore, IProgress<int> status)
{
    try
    {
        status?.Report(semaphore.CurrentCount);
        using (HttpResponseMessage response = await client.GetAsync(path, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false))
        {
            if (response.IsSuccessStatusCode) // ignoring if not success
            {
                string filePath = Path.Combine(savePathLocalComputer, Path.GetFileName(path));
                string dir = Path.GetDirectoryName(filePath);
                if (!Directory.Exists(dir)) Directory.CreateDirectory(dir);
                using (Stream responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false))
                using (FileStream fileStream = File.Create(filePath))
                {
                    await responseStream.CopyToAsync(fileStream).ConfigureAwait(false);
                }
            }
        }
    }
    finally
    {
        semaphore.Release();
    }
}

The concurrency

client.BaseAddress = "http://somesite";
int downloadCount = 0;
List<string> pathList = new List<string>();
// fill the list here

List<Task> tasks = new List<Task>();
int maxConcurrentTasks = Environment.ProcessorCount * 2; // 16 for me

IProgress<int> status = new Progress<int>(availableTasks =>
{
    downloadCount++;
    OnNewListEntry(downloadCount + " / " + pathList.Count + " heruntergeladen\r\nRunning " + (maxConcurrentTasks - availableTasks) + " downloads.");
});

using (SemaphoreSlim semaphore = new SemaphoreSlim(maxConcurrentTasks))
{
    foreach (string path in pathList)
    {
        await semaphore.WaitAsync();
        tasks.Add(DownloadAndSaveFileAsync(path, semaphore, status));
    }
    try
    {
        await Task.WhenAll(tasks);
    }
    catch (Exception ex)
    {
        // handle the Exception here
    }
}

Progress here simply executes callback on UI Thread. Thus Interlocked is not needed inside and it's safe to update UI.

In case of .NET Framework (in .NET Core has no effect but not needed) to make it faster, you may add this line to the app startup code

ServicePointManager.DefaultConnectionLimit = 10;

Efficient way to download a huge load of files in parallel

1 Answers1

Linked