9

I have a console application that is going to take about 625 days to complete. Unless there is a way to make it faster.

First off I am working in a directory that has around 4,000,000 files in if not more. I'm working in a database that has a row for each file and then some.

Now working with the SQL is relatively fast, the bottleneck is when I use File.Move() each move takes 18 seconds to complete.

Is there a faster way than File.Move()?

This is the bottleneck:

File.Move(Path.Combine(location, fileName), Path.Combine(rootDir, fileYear, fileMonth, fileName));

All of the other code runs pretty fast. All I need to do is move one file to a new location and then update the database location field.

I can show other code if needed, but really the above is the only current bottleneck.

FMFF
  • 1,652
  • 4
  • 32
  • 62
James Wilson
  • 5,074
  • 16
  • 63
  • 122
  • If you're using a database anyway, why do you need 4,000,000 files at all? – Tim Schmelter Sep 23 '13 at 21:05
  • @TimSchmelter It's originally how they designed it. The database houses some information from the file the only part I need to update is the Location column. That column is what tells the application they use where the document is located to open it. – James Wilson Sep 23 '13 at 21:06
  • If each move takes 18 seconds then something else is *very* wrong, and it's probably not your use of the API. – cdhowie Sep 23 '13 at 21:09
  • Possibly of interest? [Asynchronous File Copy/Move in C#](http://stackoverflow.com/q/882686/427192) – Dan Pichelman Sep 23 '13 at 21:09
  • How big are the files? How long does it take to move one by hand? Is this being moved across a network? – Dan Pichelman Sep 23 '13 at 21:10
  • @cdhowie What would/could that be? It'd a single directory with 4+ million files in it that isn't indexed. – James Wilson Sep 23 '13 at 21:10
  • 1
    @JamesWilson Then it's probably taking the operating system that long to update the containing directory. – cdhowie Sep 23 '13 at 21:12
  • @DanPichelman The majority of them are 100kb or less, there are quite a bit that are 1-2 MBs. The program is written on my machine which is going to a network share and move files on that share into a more organized method. – James Wilson Sep 23 '13 at 21:12
  • Is there any chance that your code can be run on the server that has the files locally? Right now you're probably pulling all that data over the network to your local machine, then back over the network again to write it out. – Dan Pichelman Sep 23 '13 at 21:16
  • @DanPichelman I can check on that, but it would have to put VS on the server which might be possible. Would a look at my code help in any way, or is it pretty likely it's the 4+ million files that is the bottleneck with no real way to improve it? – James Wilson Sep 23 '13 at 21:19
  • You won't need VS, just the .NET distribution DLLs (which are probably already there). If you have access to a server and/or network expert, talk to them about performance monitoring your machine. Ideally you're pegging the I/O on your box. – Dan Pichelman Sep 23 '13 at 21:24
  • Running the code on the box that has the files locally could make a very significant difference - kind of like moving water through a fire hose vs through a soda straw. – Dan Pichelman Sep 23 '13 at 21:25
  • @DanPichelman alright I will talk with him, thank you for the advice. That does make sense on how it would be faster. – James Wilson Sep 23 '13 at 21:36
  • @DanPichelman Looks like I may be running out of options. It is a NAS device and he said it wouldn't be possible to run it locally. – James Wilson Sep 23 '13 at 21:38

3 Answers3

15

It turns out switching from File.Move to setting up a FileInfo and using .MoveTo increased the speed significantly.

It will run in about 35 days now as opposed to 625 days.

FileInfo fileinfo = new FileInfo(Path.Combine(location, fileName));
fileinfo.MoveTo(Path.Combine(rootDir, fileYear, fileMonth, fileName));
James Wilson
  • 5,074
  • 16
  • 63
  • 122
  • This is good info. Seems odd it would be that way, though. I might have to research why this is so. – Jim Mischel Sep 24 '13 at 21:13
  • @JimMischel yeah I've been testing this all day, the speed has been a consistent change with this many files. All I could find is that File.Move checks for permission/security on each call, where fileInfo.MoveTo() only checks it a single time. If you find anything else out I'd love to know. – James Wilson Sep 24 '13 at 21:32
  • Very strange. I didn't find any speed emprovement: 10529 ms (32824028 tiks) Directory.Move, 13358 ms (41642456 tiks) new FileInfo().Move, 10926 ms (34061807 tiks) File.Move(). It is for 16385 files – Timur Lemeshko Jul 10 '17 at 13:48
2

18 seconds isn't really unusual. NTFS does not perform well when you have a lot of files in a single directory. When you ask for a file, it has to do a linear search of its directory data structure. With 1,000 files, that doesn't take too long. With 10,000 files you notice it. With 4 million files . . . yeah, it takes a while.

You can probably do this even faster if you pre-load all of the directory entries into memory. Then rather than calling the FileInfo constructor for each file, you just look it up in your dictionary.

Something like:

var dirInfo = new DirectoryInfo(path);
// get list of all files
var files = dirInfo.GetFileSystemInfos();
var cache = new Dictionary<string, FileSystemInfo>();
foreach (var f in files)
{
    cache.Add(f.FullName, f);
}

Now when you get a name from the database, you can just look it up in the dictionary. That might very well be faster than trying to get it from the disk each time.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • I'm afraid to test this as it would need to load 4 million files into the directory before it could begin any work on moving them. And then once they are in the dictionary I would still need to perform a File.Move or fileinfo.MoveTo() on the file if im not mistaken? – James Wilson Sep 24 '13 at 20:22
  • @JamesWilson: Yes, you would still need to do the `fileinfo.MoveTo()`. The idea is that pre-loading all of the entries would eliminate you having to search for them one-by-one. Whether 4 million entries is a memory problem, I don't know. I also don't know how long it'd take to load, although I suspect it'd be much less than an hour. Whether the result would be faster than your 35 days, I don't know for sure. – Jim Mischel Sep 24 '13 at 21:11
2

You can move files in parallel and also using Directory.EnumerateFiles gives you a lazy loaded list of files (of-course I have not tested it with 4,000,000 files):

var numberOfConcurrentMoves = 2;
var moves = new List<Task>();
var sourceDirectory = "source-directory";
var destinationDirectory = "destination-directory";

foreach (var filePath in Directory.EnumerateFiles(sourceDirectory))
{
    var move = new Task(() =>
    {
        File.Move(filePath, Path.Combine(destinationDirectory, Path.GetFileName(filePath)));

        //UPDATE DB
    }, TaskCreationOptions.PreferFairness);
    move.Start();

    moves.Add(move);

    if (moves.Count >= numberOfConcurrentMoves)
    {
        Task.WaitAll(moves.ToArray());
        moves.Clear();
    }
}

Task.WaitAll(moves.ToArray());
Kaveh Shahbazian
  • 13,088
  • 13
  • 80
  • 139