7

I'm trying to calculate directory sizes in a way that divides the load so that the user can see counting progress. I thought a logical way to do this would be to first create the directory tree then do an operation counting the length of all the files.

The thing that comes to me as unexpected is that the bulk of time (disk I/O) comes from creating the directory tree, then going over the FileInfo[] comes nearly instantly with virtually no disk I/O.

I've tried with both Directory.GetDirectories(), simply creating a tree of strings of the directory names, and using a DirectoryInfo object, and both methods still take the bulk of the I/O time (reading the MFT of course) compared to going over all the FileInfo.Length for the files in each directory.

I guess there's no way to reduce the I/O to make the tree significantly, I guess I'm just wondering why this operation takes significantly more time compared to going over the more numerous files?

Also, if anyone could recommend a non-recursive way to tally things up (since it seems I need to just split up the enumeration and balance it in order to make the size tallying more responsive). Making a thread for each subdirectory off the base and letting scheduler competition balance things out would probably not be very good, would it?

EDIT: Repository for this code

j.i.h.
  • 815
  • 8
  • 29
  • I've also struggled with calculating directory size. I've done exactly what you've done. Tried >fileInfo[] and then >Directory.GetDirectories(). But I still don't know of any better way. – Eric Robinson Jun 26 '12 at 17:57
  • You're saying that calling GetDirectories() takes a long time? I have not seen that but then again, I've never done this with a large amount of directories. Also, why would you care if its recursive? This is a recursive task and you're never going to have so many nested directories that you'll blow stack. – George Mauer Jun 26 '12 at 18:00
  • refer http://stackoverflow.com/questions/468119/whats-the-best-way-to-calculate-the-size-of-a-directory-in-net – Romil Kumar Jain Jun 26 '12 at 18:00
  • Side note: beware of hard/soft links and mount points, this could make results wrong. Also have some sort of protection against infinite recursion just in case. – Alexei Levenkov Jun 26 '12 at 18:14
  • Please see "[Stack Overflow does not allow tags in titles](http://meta.stackexchange.com/a/130208)". – John Saunders Jun 26 '12 at 18:47
  • @GeorgeMauer I just calculated with my Windows directory (un-cached I believe by the time it took) and that was ~2:40 at roughly 750 KB/s (slow HD I know). Roughly 120 MB, for 20,000 directories. It just seems very odd. Then (like usual) the `FileInfo[]` retrieval and calculation operation took around 10s. Also re: recursive, it's just the fact that I want the work to be somewhat evenly distributed, where recursive functions (at least how I'm using it here) goes depth-first (since it relies on waiting for the deepest nodes to return). – j.i.h. Jun 26 '12 at 20:33
  • @AlexeiLevenkov I don't think there's any way for Windows to have infinite directory loops, is there? Also with links, the problem would be counting a single file's size double or more times? – j.i.h. Jun 26 '12 at 20:42
  • @j.i.h., I hope loops are not possible, but I'm not completely sure... Check out if `Users` folder works fine as it has most links. Yes, with links sizes would be counted multiple times, also you may endup doing extra work re-iterating the same folders. – Alexei Levenkov Jun 26 '12 at 20:50

1 Answers1

4

You can utilize Parallel.ForEach to run the directory size calculation in parallel fashion. You can get the GetDirectories and run the Parallel.ForEach on each node. You can use a variable to keep track of size and display that to the user. Each parallel calculation would be incrementing on the same variable. If needed use lock() to synchronize between parallel executions.

loopedcode
  • 4,863
  • 1
  • 21
  • 21
  • You should code it so that only un-related directories are parallelized and there will be no reason to lock beyond that. Though with most disks I'm not sure what parallelizing will gain you. Disk IOs seem synchronous in nature. All that you can really make parallel is the actual addition of totals which should be negligible – George Mauer Jun 26 '12 at 18:02
  • @JasonMalinowski Really...I had no idea. Do most OSs know how to take advantage of that? I knew it was much faster due to the no moving parts bit, had no idea it enabled parallel too. – George Mauer Jun 26 '12 at 18:08
  • Good idea, I never heard of `Parallel.Foreach()`, sounds very useful. – j.i.h. Jun 26 '12 at 20:39
  • @GeorgeMauer Even spinning disks supported "native command queuing" which would allow for more than one IO to be sent to the drive so it can choose how to best satisfy the requests. Also consider situations where the filesystem had already cached part of the directory structure (so enumerating is CPU bound) and other parts weren't. I'm not saying that you'll see magic performance speedups due to parallelization, but there are many reasons why such a speedup could happen. – Jason Malinowski Jun 27 '12 at 01:55