0

I am trying to perform some transformations on some CSV files in Azure Data Lake Storage Gen2. As a first step, I am downloading the files from the data lake using a DataLakeDirectoryClient object from the Azure.Storage.Files.DataLake NuGet package.

I have a list of the file names which I'm looping over; creating a DataLakeFileClient object for each.

There are 300 files in the folder. Each is roughly 190 kb in size.

The problem is that my code stops running after downloading 50 files. No exception is thrown, so I can't identify the issue. I'm testing this in a console application, and the console remains open until I manually stop the program.

directoryClient is a DataLakeDirectoryClient object.

fileNames is a list of strings e.g. "file1.csv", "file2.csv", ..., "file300.csv".

var fileClients = new List<Stream>();
foreach (var fileName in fileNames)
{
    var fileClient = directoryClient.GetFileClient(fileName);
    var fileDownloadResponse = await fileClient.ReadAsync(); // Code hangs here on 51st file
    fileClients.Add(fileDownloadResponse.Value.Content);

    Console.WriteLine($"{fileName} downloaded.");
}

On the console, I see that the first 50 files are downloaded. For the 51st file, the fileClient is made successfully, but the ReadAsync method never returns a response.

I have separately tried to download just the 51st file, and that works with no issues. So it looks like this is nothing to do with that file in particular.

Since I could not find an explanation for this, I refactored my code. Rather than trying to download all 300 files at the start: I now just download one file at a time, perform the transformations I need, and then move to the next file. This is working for me, and all 300 files have been transformed successfully.

However, I still wanted to post this question because I would like to know exactly what went wrong with this original attempt, just to improve my own understanding of C#.

James
  • 1
  • 1
  • 2
    It might well be that you get throttled. It might not like that you have 50 pending streams to be read. If transforming doesn't take ages maybe you can execute some of it in parallel: https://stackoverflow.com/a/19103047 and https://stackoverflow.com/a/35686494 among many more. Note that it matters whether your workload is IO bound or CPU bound for the correct / preferred approach. – rene May 28 '23 at 10:17

0 Answers0