0

I have a folder full of 30 thousand PDF files (please don't ask why).

I need to loop through them and match the date on the date value chosen on the windows form date picker control.

Here is what I have:

public List<FileInfo> myList = new List<FileInfo>();
        DirectoryInfo di = new DirectoryInfo(@"\\PDFs");

myList = (di.EnumerateFiles("*.pdf").Where(x => x.LastWriteTime.Date == datetime.Date).ToList());

After I have got the files in the list, I then move them to another location for various other processing, but the aspect I definitely want to speed up is this part.

It's rather slow, is there anyway to speed this up?

Thanks.

  • Have you tried using powershell on this? Or is it not an option? – Nils Jan 18 '17 at 19:58
  • 2
    Why do you have 30,000 PDF files in a folder? – dfundako Jan 18 '17 at 19:58
  • 1
    How are you sure that is where your slowness is? Did you run a profiler on the code and saw that is where the time was being spent? or do you do something with `myList` later and that is where the real slowness is. – Scott Chamberlain Jan 18 '17 at 19:59
  • 5
    It's running over a network share. That might be the source of the slowness. –  Jan 18 '17 at 19:59
  • Read through and cache, then search off of cache? Do folder contents change often? – Jeremy Jan 18 '17 at 20:00
  • 1
    Do you need to go over the list twice? Doesn't myList = di.EnumerateFiles("*.pdf").Where(x => x.LastWriteTime.Date == DateTime.Now).ToList(); get you the same result as your foreach loop? – sous2817 Jan 18 '17 at 20:02
  • 1
    Buy a hard drive with faster read times. – Servy Jan 18 '17 at 20:03
  • @sous2817 That does not in fact iterate the data source multiple times. – Servy Jan 18 '17 at 20:03
  • @sous2817 he is not going over the list twice, EnumerateFiles is lazy loaded so it only is done once during the foreach enumeration. But you are correct in the part that doing .ToList() would be equivalent. – Scott Chamberlain Jan 18 '17 at 20:03
  • If you want all files in one list, that's probably the quickest way. If you want to process more files at once, you could break the list into chunks (like map/reduce) and process each chunk independently on a different thread. – Peter Ritchie Jan 18 '17 at 20:10
  • 1
    Note that you'll get better answers if you answer some of the clarifying questions in the comments. – Heretic Monkey Jan 18 '17 at 20:15
  • It has to be the fact that you access the files over the network like @Amy says. (I did a quick test locally comparing EnumerateFiles with GetFiles, and EnumerateFiles was the fastest by far). You should probably load this fileinfo up front in your app and filter from that. – Vidar Jan 18 '17 at 20:17
  • Scott - I put a breakpoint on the enumeratefiles and counted how long it took, took quite a while. Amy - yes potentially Sous2817 - thanks for that advice! Vidar - what do you mean load this file info up front? – user3046756 Jan 18 '17 at 20:20
  • 1
    You can retrieve this list once, for instance when your render the form before the user starts interacting with it, and store it in a local variable on your class. Then each time the user picks a new date, you can filter from that list instead of going over the wire. If you want to get fancy you might even add some FileSystemWatcher logic to handle new files apparing on the remote drive (see http://stackoverflow.com/questions/11219373/filesystemwatcher-to-watch-unc-path) – Vidar Jan 18 '17 at 20:27
  • Thanks Vidar, I'l try that. – user3046756 Jan 18 '17 at 20:47
  • @user3046756 You can start retrieving that list before creating a form. Like `var itemsTask = Task.Run(() => GetItems()); var form = new Form(itemsTask)` and `Wait` for that task when some actual action is required - that can slightly improve user experience. And if you show those items in some control you may use concurrent and observable collection, that is filled by that task, to show at least some of the items + loading progress - it would look even better for the end user. – Eugene Podskal Jan 18 '17 at 20:53
  • Why did you change the code in your question? Doing so made many of the comments (e. g. those referring to a foreach loop) nonsensical! – Chris Dunaway Jan 18 '17 at 22:48
  • Possible duplicate of [Improve the performance for enumerating files and folders using .NET](https://stackoverflow.com/questions/17756042/improve-the-performance-for-enumerating-files-and-folders-using-net) – Damian Mar 18 '19 at 12:47

2 Answers2

-1

You don't have to wait for the whole list of files (myList) to be constructed - you can start processing after the first enumerated file. Just use Parallel.ForEach to download and process a single file. In the example below I'm using the ConcurrentBag collection to store the results.

var results = new ConcurrentBag<ProcessingResult>();

var files = di.EnumerateFiles("*.pdf").Where(x => x.LastWriteTime.Date == datetime.Date);
Parallel.ForEach(files, file => {
    var newLocation = CopyToNewLocation(file);
    var processingResult = ExecuteAditionalProcessing(newLocation);

    results.Add(processingResult);
});
Damian
  • 2,752
  • 1
  • 29
  • 28
  • 1
    `Parallel.For` allows to handle CPU bound tasks faster, but it won't improve your IO bound performance (in some cases it would yield even the opposite effect) - http://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean. So do you really think that the issue is CPU-bound and not that the enumeration of 30000 files with glob matching can be a bottleneck? I agree that processing files as soon as possible is a proper approach, but it seems that OP has to have all those items **before** doing any actual processing. – Eugene Podskal Jan 18 '17 at 21:02
-2

If Powershell is an option (and I would recommend it), try this:

Get-ChildItem c:\folder | Where{$_.LastWriteTime -gt (Get-Date).AddDays(-7)}

Get-Date will return today, so the above will return all files, which were modified in the last 7 days.

Nils
  • 879
  • 1
  • 9
  • 19