rsync, directory containing 3 million files, can it keep up?

Question

I am working on a mac computer using bash commands via terminal.

I am running a DNA sequencer that generates ~3-5 million files over the course of 48 hours. For speed reasons these files are saved to the computer's SSD. I would like to use fswatch and rsync commands to monitor the directory and transfer these files to a server as they are being generated to reduce the long transfer times post sequencing.

Here is the command I have come up with.

fswatch -o ./ | (while read; do rsync -r -t /Source/Directory /Destination/Directory; done)

But I am worried that due to the large number of files >3 million and large total size > 100gb these tools might struggle to keep up. Is there a better strategy?

Thanks for your help!

score 3 · Accepted Answer · answered May 09 '17 at 19:28

The command you would use might work but would have some performance issues that I would want to avoid.

the "fswatch" would generate output on each modification of the FS (ex. every file update.
the "rsync" would each time check recursively all possible changes in the directory and it's sub directories and files. ( not counting the actual data copy, only this operation takes long time once there are a large number of files&dirs in the source and destination)

This would mean that for each line outputted by "fswatch" there would be one "rsync" instance started, while the duration of "rsync" would be larger and larger.

48 hours is a lot of time and copying the files (~100GB) wouldn't take so long anyway (disk to disk is very fast, over gigabit network is also very fast).

Instead I would propose an execution rsync -a --delete /source /destination at regular intervals (ex. 30 minutes) during the generation process and once at the end, to be sure nothing is missed. A short script could contain:

#!/bin/bash
while ps -ef | grep -q "process that generates files"; do
    echo "Running rsync..."
    rsync -a --delete /source /destination
    echo "...waiting 30 minutes"
    sleep 1800 # seconds
done
echo "Running final rsync..."
rsync -a --delete /source /destination
echo "...done."

...just replace the "process that generates files" with whatever name the process that generates files looks like in the "ps -ef" output while is it running. Adjust time as you see fit, I considered that in 30 minutes ~2GB of data are created which can be copied in a couple of minutes.

The script would ensure that "rsync" doesn't run more times then it should and it would focus into copying files instead of comparing the source and destination to often.

The option "-a" (archive) would imply the options you use and more (-rlptgoD), the "--delete" would remove any file that exists on "/destination" but doesn't exist on "/source" (handy in case of temporary files that were copied but not actually needed in the final structure).

score 1 · Answer 2 · edited May 23 '17 at 12:18

1

The filesystem limits are likely going to be a problem.

See this answer: How many files can I put in a directory?

In general, the more files in a directory, the slower the filesystem will perform.

edited May 23 '17 at 12:18

Community

1
1

answered May 09 '17 at 14:49

Chris Rouffer

743
5
14

This seems to be accounted for by the Sequencing software which bins the output into different individual directories as the output is generated. For example the output folder selected by the user will be filled with hundreds of consecutively numbered directories each containing thousands of files. The total output still adds up to millions of files but they are broken up into directory sub-groups. – Paul May 09 '17 at 14:52
2

Assuming you are using HFS+, you can fit 4,294,967,295 files on the file system. Since rsync and fswatch will be running over the course of 48 hours, you should have no trouble. – Chris Rouffer May 09 '17 at 15:04

rsync, directory containing 3 million files, can it keep up?

2 Answers2