0

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.

I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.

I tried "cp /path/to/dataset /path/to/external-ssd"

Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.

Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.

Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. It's been 24 hours and it isn't quite halfway done.

Is there a better way to do this?

SciGuy
  • 115
  • 4
  • 1
    https://askubuntu.com/ or https://unix.stackexchange.com/ are better forums for this question. StackOverflow is meant for programming questions while AskUbuntu and UNIX & Linux StackExchange cover general questions. For example, https://unix.stackexchange.com/questions/188285/how-to-copy-a-file-from-a-remote-server-to-a-local-machine. – tk421 May 20 '19 at 15:51
  • Thanks, will do. – SciGuy May 20 '19 at 17:33

1 Answers1

0

rsync is the preferred tool for this kind of workload. It is used for both local and network copies.

Main benefits are (excerpt from manpage):

  • delta-transfer algorithm, which reduces the amount of data sent
  • if it is interrupted for any reason, then you can restart it easily with very little cost. It can even restart part way through a large file
  • options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied.

Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.

Regarding command usage and syntax, for local transfers is almost the same as cp:

rsync -az /path/to/dataset /path/to/external-ssd

Gonzalo Matheu
  • 8,984
  • 5
  • 35
  • 58
  • Can you point to any references where rsync outperforms cp on large datasets? I saw this: https://stackoverflow.com/questions/6339287/copy-or-rsync-command – SciGuy May 20 '19 at 15:34