83

I'm using Ansible to copy a directory (900 files, 136MBytes) from one host to another:

---
- name: copy a directory
      copy: src={{some_directory}} dest={{remote_directory}}

This operation takes an incredible 17 minutes, while a simple scp -r <src> <dest> takes a mere 7 seconds.

I have tried the Accelerated mode, which according to the ansible docs, but to no avail.

can be anywhere from 2-6x faster than SSH with ControlPersist enabled, and 10x faster than paramiko.

dokaspar
  • 8,186
  • 14
  • 70
  • 98

6 Answers6

118

TLDR: use synchronize instead of copy.

Here's the copy command I'm using:

- copy: src=testdata dest=/tmp/testdata/

As a guess, I assume the sync operations are slow. The files module documentation implies this too:

The "copy" module recursively copy facility does not scale to lots (>hundreds) of files. For alternative, see synchronize module, which is a wrapper around rsync.

Digging into the source shows each file is processed with SHA1. That's implemented using hashlib.sha1. A local test implies that only takes 10 seconds for 900 files (that happen to take 400mb of space).

So, the next avenue. The copy is handled with module_utils/basic.py's atomic_move method. I'm not sure if accelerated mode helps (it's a mostly-deprecated feature), but I tried pipelining, putting this in a local ansible.cfg:

[ssh_connection]
pipelining=True

It didn't appear to help; my sample took 24 minutes to run . There's obviously a loop that checks a file, uploads it, fixes permissions, then starts on the next file. That's a lot of commands, even if the ssh connection is left open. Reading between the lines it makes a little bit of sense- the "file transfer" can't be done in pipelining, I think.

So, following the hint to use the synchronize command:

- synchronize: src=testdata dest=/tmp/testdata/

That took 18 seconds, even with pipeline=False. Clearly, the synchronize command is the way to go in this case.

Keep in mind synchronize uses rsync, which defaults to mod-time and file size. If you want or need checksumming, add checksum=True to the command. Even with checksumming enabled the time didn't really change- still 15-18 seconds. I verified the checksum option was on by running ansible-playbook with -vvvv, that can be seen here:

ok: [testhost] => {"changed": false, "cmd": "rsync --delay-updates -FF --compress --checksum --archive --rsh 'ssh  -o StrictHostKeyChecking=no' --out-format='<<CHANGED>>%i %n%L' \"testdata\" \"user@testhost:/tmp/testdata/\"", "msg": "", "rc": 0, "stdout_lines": []}
tedder42
  • 23,519
  • 13
  • 86
  • 102
  • 3
    Is there no way for the copy module to be faster? This seems like a bug in copy for it to be so slow? – Daniel Compton Sep 19 '16 at 21:15
  • 2
    Once you've switched to `synchronize` over `copy`, you'll need to specify `rsync_opts` if you use rsync/ssh with different ports/users/configs: https://hairycode.org/2016/02/22/using-a-custom-ssh-config-with-ansibles-synchronize-module/ – Micah Elliott Jan 20 '17 at 19:50
  • What if I want to copy a directory locally, i.e., using the `copy` module with setting `remote_src: yes`? It is likely that `synchronize` cannot be used in this situation. – kimamula Jul 06 '17 at 09:44
  • 1
    You deserve a drink mate, Nice answer – Venkata S S K M Chaitanya Apr 25 '18 at 16:03
  • This is the way to go!! Reduced my time to send over my vim dotfiles and color schemes from 175 and 157 seconds to 0.19s and 0.17s (tested with profile_tasks callback). I can't believe how many *MINUTES* I've spent watching that thing until we implemented this. NOTE: It may be helpful to instruct a 'file' task to set the user and group permissions after the synchronize operation is done (user/group functionality is not useful in synchronize module). – mathewguest Sep 21 '19 at 09:36
17

synchronize configuration can be difficult in environments with become_user. For one-time deployments you can archive source directory and copy it with unarchive module:

- name: copy a directory
  unarchive:
    src: some_directory.tar.gz
    dest: {{remote_directory}}
    creates: {{remote_directory}}/indicator_file
void
  • 2,759
  • 12
  • 28
  • 2
    And how to archive local directory? `archive` seems to support only remote folders. – Ivan Kleshnin Jun 18 '19 at 06:31
  • 2
    This answer is not suitable for maintaining remote directory in sync with ever-changing local one. It assumes that the local version is a kind of immutable image, which needs to be deployed only once. In that case one can archive it with `tar -cvpzf `, then put resulting archive into `files/` subfolder of a playbook and then use `unarchive` module for faster deployment, faster than `scp` in the question. – void Jun 18 '19 at 18:57
  • 2
    I know, thanks. Syncing and immutable overrides are two different things and I happen to need the latter. For the interest of potential readers, I solved the problem with `archive` by using `delegate_to`. – Ivan Kleshnin Jun 19 '19 at 04:43
2

Best solution I have found is to just zip the folder and use the unarchive module.

450 MB folder finished in 1 minute.


unarchive:
   src: /home/user/folder1.tar.gz
   dest: /opt
hd1
  • 33,938
  • 5
  • 80
  • 91
Rinu K V
  • 37
  • 1
2

Providing the main.yml task conventions

- name: "Copy Files"
  synchronize:
    src: <source>
    dest: <destination>
    rsync_opts:
      - "--chmod=F755" # provide here give also permission
Yakir GIladi Edry
  • 2,511
  • 2
  • 17
  • 16
-2

Also, using Mitogen https://github.com/mitogen-hq/mitogen/ can help, but it doesn't support modern Ansible versions for now, and has some compatibility issues. Also it's a great solution when hundreds of files need to be templated faster.

fragpit
  • 47
  • 1
  • 6
-5

While synchronize is more preferable in this case than copy, it’s baked by rsync. It means that drawbacks of rsync (client-server architecture) are remained as well: CPU and disc boundaries, slow in-file delta calculations for large files etc. Sounds like for you the speed is critical, so I would suggest you look for a solution based on peer-to-peer architecture, which is fast and easily scalable to many machines. Something like BitTorrent-based, Resilio Connect.