133

We need to transfer 15TB of data from one server to another as fast as we can. We're currently using rsync but we're only getting speeds of around 150Mb/s, when our network is capable of 900+Mb/s (tested with iperf). I've done tests of the disks, network, etc and figured it's just that rsync is only transferring one file at a time which is causing the slowdown.

I found a script to run a different rsync for each folder in a directory tree (allowing you to limit to x number), but I can't get it working, it still just runs one rsync at a time.

I found the script here (copied below).

Our directory tree is like this:

/main
   - /files
      - /1
         - 343
            - 123.wav
            - 76.wav
         - 772
            - 122.wav
         - 55
            - 555.wav
            - 324.wav
            - 1209.wav
         - 43
            - 999.wav
            - 111.wav
            - 222.wav
      - /2
         - 346
            - 9993.wav
         - 4242
            - 827.wav
      - /3
         - 2545
            - 76.wav
            - 199.wav
            - 183.wav
         - 23
            - 33.wav
            - 876.wav
         - 4256
            - 998.wav
            - 1665.wav
            - 332.wav
            - 112.wav
            - 5584.wav

So what I'd like to happen is to create an rsync for each of the directories in /main/files, up to a maximum of, say, 5 at a time. So in this case, 3 rsyncs would run, for /main/files/1, /main/files/2 and /main/files/3.

I tried with it like this, but it just runs 1 rsync at a time for the /main/files/2 folder:

#!/bin/bash

# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"

# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5

# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
    # Make sure to ignore the parent folder
    if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
    then
        # Strip leading dot slash
        subfolder=$(echo "${dir}" | sed 's@^\./@@g')
        if [ ! -d "${target}/${subfolder}" ]
        then
            # Create destination folder and set ownership and permissions to match source
            mkdir -p "${target}/${subfolder}"
            chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
            chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
        fi
        # Make sure the number of rsync threads running is below the threshold
        while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
        do
            echo "Sleeping ${sleeptime} seconds"
            sleep ${sleeptime}
        done
        # Run rsync in background for the current subfolder and move one to the next one
        nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
    fi
done

# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"
BT643
  • 3,495
  • 5
  • 34
  • 55

11 Answers11

174

Updated answer (Jan 2020)

xargs is now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsync tasks the command would be:

ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/

This will list all folders in /srv/mail, pipe them to xargs, which will read them one-by-one and and run 4 rsync processes at a time. The % char replaces the input argument for each command call.

Original answer using parallel:

ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}
Manu
  • 3,212
  • 1
  • 16
  • 14
  • 19
    Note, if you customize your `ls` output through various means, such as the `LISTFLAGS ` variable or `DIR_COLORS` file, you may need to use `ls --indicator-style=none` to prevent `ls` from appending symbols to the path name (such as `*` for executable files). – chadrik Dec 10 '15 at 23:51
  • 4
    I found this worked much better if I used cd /sourcedir ; parallel -j8 -i rsync -aqH {} /destdir/{} -- * – Criggie Jul 06 '16 at 03:31
  • That's a placeholder for the filenames you get piped from the `ls` command before. `man parallel` should have more details. The `find` command uses the same I believe. – Manu Nov 08 '18 at 14:45
  • 2
    This is not an efficient solution, as shown here: https://unix.stackexchange.com/questions/189878/parallelise-rsync-using-gnu-parallel This solution will create one rsync call per file in the listing – Prometheus Nov 13 '18 at 08:25
  • Depends on where your bottleneck is. When I used this command, I was limited in the bandwidth-per-connection. So using more memory for extra rsync instances was ok. Your use case may be different. – Manu Nov 14 '18 at 14:46
  • 3
    This answer was very helpful! I suggest adding `--sshdelay 0.2` just before `rsync` to make sure you don't overload the sshd on the remote server. – pzelasko Jul 25 '19 at 12:09
  • just tried: ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}, which will generate sub dir with same name when sync dirs. And remove the last '{}', it works as expect . – gzerone Dec 03 '20 at 09:00
  • what is use of -n (--max-args) arg here? – Adil Saju Feb 18 '21 at 10:21
  • I would say that `parallel` is still a *much* better program than `xargs`. It's rarely preinstalled but it's always one of the first things I grab on a new machine. – forresthopkinsa Mar 22 '21 at 17:23
  • @AdilSaju and @Prometheus The `-n` parameter of `xargs` limits how many arguments are given to each instance of `rsync`. Actually, it is redundant here, as using the -I option forces `xargs` into "one input line at a time" mode anyways. – Kai Petzke Jul 31 '21 at 10:23
  • What about hidden files , system files etc ? does ls cut it ? – Gediz GÜRSU Sep 10 '21 at 08:07
  • @ManuelRiel Great answer especially with the update! Just wanted to add my two cents of using this approach; if your directory sizes are skewed then this will suffer as one stream can take forever to complete. I solved it by recreating the dir tree on the remote first and then adjusting the streams based on skewness. – Karan Chopra Nov 28 '21 at 14:12
  • This will fail if you have spaces in your filenames: https://stackoverflow.com/questions/16758525/make-xargs-handle-filenames-that-contain-spaces – asmaier Sep 06 '22 at 15:58
  • I have to cd source dir and excute `ls /source/files/ --indicator-style=none | xargs -n1 -P4 -I% rsync -Pa % 192.168.3.200:/target/files/ ` and it work. – leonardosccd Oct 04 '22 at 04:15
  • Interesting. `xargs` seems to be faster than `parallel`, but have fewer safety and order checks. See: [How do xargs and gnu parallel differ when parallelizing code?](https://stackoverflow.com/a/62513178/4561887) – Gabriel Staples Jul 14 '23 at 01:11
55

Have you tried using rclone.org?

With rclone you could do something like

rclone copy "${source}/${subfolder}/" "${target}/${subfolder}/" --progress --multi-thread-streams=N

where --multi-thread-streams=N represents the number of threads you wish to spawn.

dantebarba
  • 1,396
  • 1
  • 12
  • 22
  • 1
    Fatal error: unknown flag: --multi-thread-streams – Stepan Yakovenko Jan 14 '21 at 18:06
  • 3
    @StepanYakovenko I've tested the flag and it's working in version 1.55.1: `rclone copy killmouseaccel killmouseaccel2 --multi-thread-streams=4 --progress 2021/06/01 13:50:30 NOTICE: Config file not found - using defaults Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA - Transferred: 1 / 1, 100% Elapsed time: 0.0s` – dantebarba Jun 01 '21 at 16:54
  • Same to @Han.Oliver: The flag is working as I've pointed out in my last comment – dantebarba Jun 01 '21 at 16:55
  • 13
    Best option. Just run with 32 streams and its almost 50x faster than copying using finder or rsync. – Haine Sep 15 '21 at 19:45
  • what's the difference between `--multi-thread-streams` and `--transfers` ? – soulmachine May 25 '22 at 01:04
  • 1
    @soulmachine the first one spawns threads based on download chunks whereas transfers refers to the maximum number of simultaneous downloads that rclone can perform. With multi-thread-streams you can download one file splitting it into chunks that are downloaded concurrently. But if you are only downloading one file the --transfers option won't make any difference. – dantebarba May 26 '22 at 21:16
  • Building on the previous comments, you'll want to play with the number of `--transfers` allowed (the default seems to be 4) to speed up large copy operations with many files. I found something like `10` to be more reasonable. – Nick Jul 20 '22 at 19:00
  • 2
    Today I Learnt about `rclone` - Thank you so much I think this will do what I need. I need to copy tons of small files like 2MB but chunk them in parallel because of my lovely internet connection works better with multiple upload sockets instead of one. Thanks! – Piotr Kula Aug 31 '22 at 20:46
41

rsync transfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp (for example). rsync is even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.

A simpler way to run rsync in parallel would be to use parallel. The command below would run up to 5 rsyncs in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.

run_rsync() {
    # e.g. copies /main/files/blah to /main/filesTest/blah
    rsync -av "$1" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*
Stuart Caie
  • 2,803
  • 14
  • 15
  • Doesn't seem I can get parallel on Ubuntu Server 12.04 with `apt-get install parallel`. Don't really want to start installing stuff manually just for this because it's very rarely going to be needed. I was just hoping for a quick script I could do it with. – BT643 Jun 05 '14 at 13:52
  • 5
    @BT643: Use `apt-get install moreutils` to install `parallel` – codersofthedark Dec 03 '14 at 19:50
  • @dragosrsupercool Thanks, will keep that in mind when I ever need to do anything like this in future :) – BT643 Dec 05 '14 at 11:06
  • 11
    While yes copying single files go "as fast as possible", many many many times there seem to be some kind of cap on a single pipe where simultaneous transfers do not appear to choke each others' bandwidth thus meaning parallel transfers are far more efficient and faster than single transfers. – EkriirkE Aug 28 '15 at 22:32
  • How to install parallel in Linux? – PKHunter Sep 21 '15 at 23:36
  • @PKHunter see @codesofthedark comment: `apt-get install moreutils` – Mark Nov 01 '16 at 20:32
  • could we modify this to use xargs -P? it's available by default usually. – eyeApps LLC Sep 24 '17 at 01:19
  • 1
    Given that the answer links to the website for GNU `parallel`, it should be noted that the `moreutils` package installs a different binary with the same name. Both will accept the arguments given in this answer, but the GNU version should be installed with `apt-get install parallel` if you are reading the GNU documentation. – sjy Feb 09 '19 at 10:33
  • In other words, parallelism is never a good option when there is no network involved (disk to disk within the same machine for example)? – Gaia Jul 15 '21 at 16:04
  • @Gaia depends on the hardware. do your own benchmarks. – Ярослав Рахматуллин Feb 04 '23 at 14:41
32

You can use xargs which supports running many processes at a time. For your case it will be:

ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/
nickgryg
  • 25,567
  • 5
  • 77
  • 79
13

There are a number of alternative tools and approaches for doing this listed arround the web. For example:

  • The NCSA Blog has a description of using xargs and find to parallelize rsync without having to install any new software for most *nix systems.

  • And parsync provides a feature rich Perl wrapper for parallel rsync.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Bryan P
  • 5,900
  • 5
  • 34
  • 49
  • 17
    Please don't just post some tool or library as an answer. At least demonstrate [how it solves the problem](http://meta.stackoverflow.com/a/251605) in the answer itself. – Baum mit Augen Jul 31 '17 at 18:46
  • 2
    @i_m_mahii Stack Exchange should automatically keep a copy of linked pages. – Franck Dernoncourt Aug 12 '17 at 19:53
  • parsync is awesome – James Hirschorn Mar 10 '19 at 06:28
  • 21
    Contrary to what some others may say, proposing a solution that is merely tools does help some of us. The "conform or go away!" crowd apparently doesn't actually just want to help others. so thanks for your post on behalf of all those who just discovered those two packages today from your post, and those who realized that xarg and find (without those packages) could also do the trick. Post and let the voters do their bit and ignore the bitter "get off my site" guys who seem to wander around here from time to time "enforcing". – TheSatinKnight Jun 25 '19 at 20:01
  • 3
    Since many of us who are actually reading this particular post know what we're looking for already, and since the OP provided a detailed question, proposing an advanced use case here is appropriate. I don't want some generic example (as I shouldn't be copying and pasting it for my application anyway) as to how to use these tools; I'm going to read the docs and figure it out myself. Trust but verify. – nicorellius Aug 01 '19 at 15:44
  • Also disagree with the pedantic answer. Knowing which tools exist for this task solves 80% of the problem. – Gabriel Magana Jan 17 '22 at 02:56
9

I've developed a python package called: parallel_sync

https://pythonhosted.org/parallel_sync/pages/examples.html

Here is a sample code how to use it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)

parallelism by default is 10; you can increase it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)

however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.

max
  • 9,708
  • 15
  • 89
  • 144
7

3 - 4 tricks for speeding up rsync.

1. Copying from/to local network: don't use ssh!

If you're locally copying a server to another, there is no need to encrypt data during transfer!

By default, rsync use ssh to transer data through network. To avoid this, you have to create a rsync server on target host. You could punctually run daemon by something like:

rsync --daemon --no-detach --config filename.conf

where minimal configuration file could look like: (see man rsyncd.conf)

filename.conf

port = 12345
[data]
       path = /some/path
       use chroot = false

Then

rsync -ax rsync://remotehost:12345/data/. /path/to/target/.
rsync -ax /path/to/source/. rsync://remotehost:12345/data/.

1.1. Minimal rsyncd.conf for restricting connection.

Regarding jeremyjjbrown's comment about security, here is a minimal config sample using dedicated network interfaces:

Main public server:

eth0:  1.2.3.4/0          Public address Main
eth1:  192.168.123.45/30  Backup network

A 30bits network could hold only two hosts.

┏━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━┓
┃ Network base│192.168.123.44 │         #0│11000000 10101000 01111011 001011│00┃
┃ Mask        │255.255.255.252│/30        │11111111 11111111 11111111 111111│00┃
┃ Broadcast   │192.168.123.47 │         #3│11000000 10101000 01111011 001011│11┃
┃ Host/net    │2              │Class C    │                                 │  ┃
┠─────────────┼───────────────┼───────────┼─────────────────────────────────┼──┨
┃▸First host  │192.168.123.45 │         #1│11000000 10101000 01111011 001011│01┃
┃ Last host   │192.168.123.46 │         #2│11000000 10101000 01111011 001011│10┃
┗━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━┛

Backup server:

eth0:  1.2.3.5/0          Public address Backup
eth1:  192.168.123.46/30  Backup network

cat >/etc/rsyncd.conf <<eof
address 192.168.123.46
[main]
    path    = /srv/backup/backup0
    comment = Backups
    read only       = false
    uid     = 0
    gid     = 0
eof

So rsync will listen only on connection comming to 192.168.123.46 aka second network interface.

Then rsync is run from main server

rsync -zaSD --zc zstd --delete --numeric-ids /mnt/. rsync://192.168.123.46/main/.

Of course, adding some rule in your firewall could be not totally useless.

iptables -I INPUT -i eth0 -p tcp --dport 873 -j DROP

2. Using zstandard zstd for high speed compression

Zstandard could be upto 8x faster than the common gzip. So using this newer compression algorithm will improve significantly your transfer!

rsync -axz --zc=zstd rsync://remotehost:12345/data/. /path/to/target/.
rsync -axz --zc=zstd /path/to/source/. rsync://remotehost:12345/data/.

with maybe some --exclude directives (See at bottom of this answer!).

3. Multiplexing rsync to reduce inactivity due to browse time

Two important remarks:

As this kind of optimisation is about disk access and filesystem structure. There is nothing to see with number of CPU! So this could improve transfer even if your host use single core CPU. If you plan to use any parallelizer tool, you have to tell him to not consider number of physical CPU.

As the goal is to ensure maximum data are using bandwidth while other task browse filesystem, the most suited number of simultaneous process depend on number of small files presents.

3.1 script using wait -n -p PID:

Recent bash added a -p PID feature to wait builtin. Just the must for this kind of jobs:

#!/bin/bash

maxProc=3
source=''
destination='rsync://remotehost:12345/data/'

declare -ai start elap results order
wait4oneTask() {
    local _i
    wait -np epid
    results[epid]=$?
    elap[epid]=" ${EPOCHREALTIME/.} - ${start[epid]} "
    unset "running[$epid]"
    while [ -v elap[${order[0]}] ];do
        _i=${order[0]}
        printf " - %(%a %d %T)T.%06.0f %-36s %4d %12d\n" "${start[_i]:0:-6}" \
               "${start[_i]: -6}" "${paths[_i]}" "${results[_i]}" "${elap[_i]}"
        order=(${order[@]:1})
    done
}
printf "   %-22s %-36s %4s %12s\n" Started Path Rslt 'microseconds'
for path; do
    rsync -axz --zc zstd "$source$path/." "$destination$path/." &
    lpid=$!
    paths[lpid]="$path" 
    start[lpid]=${EPOCHREALTIME/.}
    running[lpid]=''
    order+=($lpid)
    ((${#running[@]}>=maxProc)) && wait4oneTask
done
while ((${#running[@]})); do
    wait4oneTask
done

Output could look like:

myRsyncP.sh files/*/*
   Started                Path                                 Rslt microseconds
 - Fri 03 09:20:44.673637 files/1/343                             0      1186903
 - Fri 03 09:20:44.673914 files/1/43                              0      2276767
 - Fri 03 09:20:44.674147 files/1/55                              0      2172830
 - Fri 03 09:20:45.861041 files/1/772                             0      1279463
 - Fri 03 09:20:46.847241 files/2/346                             0      2363101
 - Fri 03 09:20:46.951192 files/2/4242                            0      2180573
 - Fri 03 09:20:47.140953 files/3/23                              0      1789049
 - Fri 03 09:20:48.930306 files/3/2545                            0      3259273
 - Fri 03 09:20:49.132076 files/3/4256                            0      2263019

Quick check:

printf "%'d\n" $(( 49132076 + 2263019 - 44673637)) \
    $((1186903+2276767+2172830+1279463+2363101+2180573+1789049+3259273+2263019))
6’721’458
18’770’978

There was 6,72seconds elapsed to process 18,77seconds under upto three subprocess.

Note: you could use musec2str to improve ouptut, by replacing 1st long printf line by:

        musec2str -v elapsed "${elap[i]}"
        printf " - %(%a %d %T)T.%06.0f %-36s %4d %12s\n" "${start[i]:0:-6}" \
               "${start[i]: -6}" "${paths[i]}" "${results[i]}" "$elapsed"
myRsyncP.sh files/*/*
   Started                Path                                 Rslt      Elapsed
 - Fri 03 09:27:33.463009 files/1/343                             0   18.249400"
 - Fri 03 09:27:33.463264 files/1/43                              0   18.153972"
 - Fri 03 09:27:33.463502 files/1/55                             93   10.104106"
 - Fri 03 09:27:43.567882 files/1/772                           122   14.748798"
 - Fri 03 09:27:51.617515 files/2/346                             0   19.286811"
 - Fri 03 09:27:51.715848 files/2/4242                            0    3.292849"
 - Fri 03 09:27:55.008983 files/3/23                              0    5.325229"
 - Fri 03 09:27:58.317356 files/3/2545                            0   10.141078"
 - Fri 03 09:28:00.334848 files/3/4256                            0   15.306145"

The more: you could add overall stat line by some edits in this script:

#!/bin/bash

maxProc=3  source=''  destination='rsync://remotehost:12345/data/'

. musec2str.bash # See https://stackoverflow.com/a/72316403/1765658

declare -ai start elap results order
declare -i sumElap totElap

wait4oneTask() {
    wait -np epid
    results[epid]=$?
    local -i _i crtelap=" ${EPOCHREALTIME/.} - ${start[epid]} "
    elap[epid]=crtelap sumElap+=crtelap
    unset "running[$epid]"
    while [ -v elap[${order[0]}] ];do  # Print status lines in command order.
        _i=${order[0]}
    musec2str -v helap ${elap[_i]}
        printf " - %(%a %d %T)T.%06.f %-36s %4d %12s\n" "${start[_i]:0:-6}" \
               "${start[_i]: -6}" "${paths[_i]}" "${results[_i]}" "${helap}"
        order=(${order[@]:1})
    done
}
printf "   %-22s %-36s %4s %12s\n" Started Path Rslt 'microseconds'
for path;do
    rsync -axz --zc zstd "$source$path/." "$destination$path/." &
    lpid=$! paths[lpid]="$path" start[lpid]=${EPOCHREALTIME/.}
    running[lpid]='' order+=($lpid)
    ((${#running[@]}>=maxProc)) &&
        wait4oneTask
done
while ((${#running[@]})) ;do
    wait4oneTask
done

totElap=${EPOCHREALTIME/.}
for i in ${!start[@]};do  sortstart[${start[i]}]=$i;done
sortstartstr=${!sortstart[*]}
fstarted=${sortstartstr%% *}
totElap+=-fstarted
musec2str -v hTotElap $totElap
musec2str -v hSumElap $sumElap
printf " = %(%a %d %T)T.%06.0f %-41s %12s\n" "${fstarted:0:-6}" \
   "${fstarted: -6}" "Real: $hTotElap, Total:" "$hSumElap"

Could produce:

$ ./parallelRsync Data\ dirs-{1..4}/Sub\ dir{A..D}
   Started                Path                                 Rslt microseconds
 - Sat 10 16:57:46.188195 Data dirs-1/Sub dirA                    0     1.69131"
 - Sat 10 16:57:46.188337 Data dirs-1/Sub dirB                  116    2.256086"
 - Sat 10 16:57:46.188473 Data dirs-1/Sub dirC                    0      1.1722"
 - Sat 10 16:57:47.361047 Data dirs-1/Sub dirD                    0    2.222638"
 - Sat 10 16:57:47.880674 Data dirs-2/Sub dirA                    0    2.193557"
 - Sat 10 16:57:48.446484 Data dirs-2/Sub dirB                    0    1.615003"
 - Sat 10 16:57:49.584670 Data dirs-2/Sub dirC                    0    2.201602"
 - Sat 10 16:57:50.061832 Data dirs-2/Sub dirD                    0    2.176913"
 - Sat 10 16:57:50.075178 Data dirs-3/Sub dirA                    0    1.952396"
 - Sat 10 16:57:51.786967 Data dirs-3/Sub dirB                    0    1.123764"
 - Sat 10 16:57:52.028138 Data dirs-3/Sub dirC                    0    2.531878"
 - Sat 10 16:57:52.239866 Data dirs-3/Sub dirD                    0    2.297417"
 - Sat 10 16:57:52.911924 Data dirs-4/Sub dirA                   14    1.290787"
 - Sat 10 16:57:54.203172 Data dirs-4/Sub dirB                    0    2.236149"
 - Sat 10 16:57:54.537597 Data dirs-4/Sub dirC                   14    2.125793"
 - Sat 10 16:57:54.561454 Data dirs-4/Sub dirD                    0     2.49632"
 = Sat 10 16:57:46.188195 Real: 10.870221", Total:                    31.583813"

Fake rsync for testing this script

Note: For testing this, I've used a fake rsync:

## Fake rsync wait 1.0 - 2.99 seconds and return 0-255 ~ 1x/10
rsync() { sleep $((RANDOM%2+1)).$RANDOM;exit $(( RANDOM%10==3?RANDOM%128:0));}
export -f rsync

4. Important step to speed up rsync process: avoid to slow him down!!

You could have to give some time to adequately configure the way you will avoid to synchronize useless datas!!

Search in man page for exclude and/or include:

  --cvs-exclude, -C        auto-ignore files in the same way CVS does
  --exclude=PATTERN        exclude files matching PATTERN
  --exclude-from=FILE      read exclude patterns from FILE
  --include=PATTERN        don't exclude files matching PATTERN
  --include-from=FILE      read include patterns from FILE

For saving user directory, I often use:

rsync -axz --delete --zc zstd --exclude .cache --exclude cache  source/. target/.

Read carefully FILTER RULES section in man page:

man -P'less +/^FILTER\ RULES' rsync

Conclusion:

Read quietly man pages!! man rsync and man rsyncd.conf !!

F. Hauri - Give Up GitHub
  • 64,122
  • 17
  • 116
  • 137
  • "If you're locally copying a server to another, there is no need to encrypt data during transfer!" That is a completely outdated attitude. – jeremyjjbrown May 11 '23 at 17:37
  • @jeremyjjbrown Yes, but no!! There is a lot of case - using dedicated physical network for sample - where avoiding encryption will **significantly** reduce footprint **and** speedup transfert! Of course, you have to rightly configure your network, your *`rsyncd.conf`* and maybe your firewall! But your comment are not absolutely right! – F. Hauri - Give Up GitHub Jun 19 '23 at 12:14
  • @jeremyjjbrown Answering your comment, I've edited my answer (today). Feel free to revise your vote! ;-) – F. Hauri - Give Up GitHub Jun 19 '23 at 12:44
  • @jeremyjjbrown Added a 4th important step to do for speeding up rsync! – F. Hauri - Give Up GitHub Jun 19 '23 at 14:18
6

The simplest I've found is using background jobs in the shell:

for d in /main/files/*; do
    rsync -a "$d" remote:/main/files/ &
done

Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.

You could add

while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done

inside the loop for a primitive form of job control.

sba
  • 1,829
  • 19
  • 27
4

The shortest version I found is to use the --cat option of parallel like below. This version avoids using xargs, only relying on features of parallel:

cat files.txt | \
  parallel -n 500 --lb --pipe --cat rsync --files-from={} user@remote:/dir /dir -avPi

#### Arg explainer
# -n 500           :: split input into chunks of 500 entries
#
# --cat            :: create a tmp file referenced by {} containing the 500 
#                     entry content for each process
#
# user@remote:/dir :: the root relative to which entries in files.txt are considered
#
# /dir             :: local root relative to which files are copied

Sample content from files.txt:

/dir/file-1
/dir/subdir/file-2
....

Note that this doesn't use -j 50 for job count, that didn't work on my end here. Instead I've used -n 500 for record count per job, calculated as a reasonable number given the total number of records.

Valer
  • 844
  • 1
  • 7
  • 17
1

I've found UDR/UDT to be an amazing tool. The TLDR; It's a UDT wrapper for rsync, utilizing multiple UPD connections rather than a single TCP connection.

References: https://udt.sourceforge.io/ & https://github.com/jaystevens/UDR#udr

If you use any RHEL distros, they've pre-compiled it for you... http://hgdownload.soe.ucsc.edu/admin/udr

The ONLY downside I've encountered is that you can't specify a different SSH port, so your remote server must use 22.

Anyway, after installing the rpm, it's literally as simple as:

udr rsync -aP user@IpOrFqdn:/source/files/* /dest/folder/

and your transfer speeds will increase drastically in most cases, depending on the server I've seen easily 10x increase in transfer speed.

Side note: if you choose to gzip everything first, then make sure to use --rsyncable arg so that it only updates what has changed.

lemonskunnk
  • 296
  • 1
  • 12
-1

using parallel rsync on a regular disk would only cause them to compete for the i/o, turning what should be a sequential read into an inefficient random read. You could try instead tar the directory into a stream through ssh pull from the destination server, then pipe the stream to tar extract.

  • 3
    the question does not invite an evaluation of a hypothetical scenario. the problem is stated as **saturating the available link** _given that there is_ or _presumably because there is enough_ IO capacity to do so. – Ярослав Рахматуллин Feb 03 '23 at 05:36