15

I have a very big storage disk (16T). I want to run 'du' on it to figure out how much each subdirectory takes. However, that takes a very long time. Luckily, I have at my disposal a cluster of computers. I can therefore run 'du' on parallel, each job runs on a separate subdirectory, and write a simple script that does that. Is there already such a thing or must I write it myself?

R S
  • 11,359
  • 10
  • 43
  • 50

3 Answers3

13

It is simple to do it using GNU Parallel:

parallel du ::: */*
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • 2
    If anyone is wondering what the magic `:::` incantation does, search for "::: arguments" in the documentation: https://www.gnu.org/software/parallel/man.html: "Use arguments from the command line as input source instead of stdin (standard input). Unlike other options for GNU parallel ::: is placed after the command and before the arguments." – Mihai Todor Nov 02 '18 at 09:41
  • 1
    Spend 15 minutes reading chapter 1+2 if you want to learn more: https://doi.org/10.5281/zenodo.1146014 – Ole Tange Nov 02 '18 at 20:04
  • Oh, that's great! Thank you for sharing this book! :) – Mihai Todor Nov 03 '18 at 01:06
3

It is not clear from your question how your storage is designed (RAID array, NAS, NFS or something else).

But, almost regardless of actual technology, running du in parallel may not be such a good idea after all - it is very likely to actually slow things down.

Disk array has limited IOPS capacity, and multiple du threads will all take from that pool. Even worse, often single du slows down any other IO operations many times, even if du process does not consume a lot of disk throughput.

By comparison, if you have just single CPU, running parallel make (make -j N) will slow down build process because process switching has considerable overhead.

Same principle is applicable to disks, especially to spinning disks. The only situation when you will gain considerable speed increase is when you have N drives mounted in independent directories (something like /mnt/disk1, /mnt/disk2, ..., /mnt/diskN). In such case, you should run du in N threads, 1 per disk.

One common improvement to increase du speed is to mount your disks with noatime flag. Without this flag, massive disk scanning creates a lot of write activity to update access time. If you use noatime flag, write activity is avoided, and du works much faster.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • This is my university's storage, so I'm not familiar with the details. However, since this is a big disk/s whose purpose is to serve as the disk for a cluster (condor in this case), I am assuming it is designed to support multiple, if not many, IO operations at once. – R S Jul 07 '14 at 08:27
  • How your client computers are using this storage? NFS mount? If yes, then parallel scan might work, because NFS has considerable network round-trip overhead – mvp Jul 07 '14 at 08:30
  • Is there a way for me to check this myself (some command like to run)? – R S Jul 07 '14 at 08:31
  • Assuming that your client computers are Linux or any other Unix-like systems, simple check would be to use `mount` and `df` to check where and how directory that has 16TB drive is mounted. – mvp Jul 07 '14 at 08:35
  • Yep: ... type nfs (rw,nosuid,relatime,vers=3,rsize=16384,wsize=16384,namlen=255,soft,proto=tcp,port=2049,timeo=25,retrans=3,sec=sys,local_lock=none,addr=x.x.x.x) – R S Jul 07 '14 at 08:41
  • You might have better luck if you could somehow get local access to that storage - NFS is notoriously slow in these situations. On my home server, I have 12TB RAID array with 8TB used (4 million files), and local `du` in single thread over whole array took just 12 minutes. – mvp Jul 07 '14 at 09:58
  • One more thought - many sites/servers have server-side scripts that automatically create **[ls-lR](http://www.fifi.org/doc/mirror/html/mirror-lslR.html)** files. If you have something like this present, all you need is to analyze `ls-lR` file - that should be very easy and quick operation. – mvp Jul 07 '14 at 10:06
3

Is there already such a thing or must I write it myself?

I wrote sn for myself, but you might appreciate it too.

sn p .

will give you sizes of everything in the current directory. It runs in parallel and is faster than du on large directories.

  • Have you considered to apply to Homebrew and add your tool as an install recipe? – dimitarvp Sep 25 '19 at 20:28
  • Furthermore, executing `sn o -n30` puts 123GB directory below a 251MB one. :( Seems that the sorting does not respect the humanised format. – dimitarvp Sep 25 '19 at 20:35