1

I am running my shell script on machineA which copies the files from machineB and machineC to machineA.

If the file is not there in machineB, then it should be there in machineC for sure. So I will try to copy file from machineB first, if it is not there in machineB then I will go to machineC to copy the same files.

In machineB and machineC there will be a folder like this from which I am supposed to copy the files -

/data/pe_t1_snapshot/20140317

I need to copy around 400 files in machineA from machineB and machineC and each file size is around 3.5 GB and network is 10 Gigabytes which is encrypted and decrypted at the both ends.

Earlier, I was trying to copy the files one by one in machineA which is really slow and it is taking around 3 hours. Is there any way, I can have 5 different threads and each thread handles a file at a time so there should only be 5 background processes running. I don't want to download all the files in parallel since 400 parallel transfers will cause packet loss and angry network admins :)

Or Split the big group of files in sets of five files, and download those five files in parallel until all the files have been completed?

Below is my shell script which copies the file one by one in machineA from machineB and machineC.

#!/bin/bash

readonly PRIMARY=/export/home/david/dist/primary
readonly FILERS_LOCATION=(machineB machineC)
PRIMARY_PARTITION=(0 3 5 7 9 11 13 15 17 19 21 23 25 27 29) # this will have more file numbers around 400

dir1=/data/pe_t1_snapshot/20140317

# delete all the files first
find "$PRIMARY" -mindepth 1 -delete
for el in "${PRIMARY_PARTITION[@]}"
do
    scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/s5_daily_1980_"$el"_200003_5.data $PRIMARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir1/s5_daily_1980_"$el"_200003_5.data $PRIMARY/.
done

Problem Statement:-

I don't want to download ALL files in parallel. I am just trying to limit the number of threads to four or five. Our Unix Admin suggested me to try like this and it will help me in my file transfers speed and I am not sure how I can enforce the number of threads in my above shell script or split the big group of file numbers into sets of five files and download them in parallel?

Is this possible to do? If yes, then can anyone provide an example on this?

john
  • 11,311
  • 40
  • 131
  • 251
  • The bottleneck will be either (a) write performance on machine A or (b) CPU on machines B/C due to `scp` (SSH) encryption. In both cases doing transfers in parallel will not help with total transfer duration (in fact, results will probably be worse, even). – Sigi May 04 '14 at 18:13
  • If you still want to try: have a look at GNU `parallel` (especially the `--jobs` option), and investigate how to turn off SSH encryption (or how to use a very fast encryption algorithm, actually I don't know if there is one). – Sigi May 04 '14 at 18:16
  • @Sigi: I can understand your concern. In any case, our Unix Admin is suggesting me to try like this since they are throttling the network a little bit and they said, this will help me a lot in my speed transfer. To prove him wrong, I need to run this so that I can debate with him on this. If possible, can you provide an example? – john May 04 '14 at 18:18
  • How to make this in parallel using GNU or using SCP as well, download 5 files at a time. – john May 04 '14 at 18:20

1 Answers1

0

Something like this:

do_copy() {
  el=$1
  scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[0]}:$dir1/s5_daily_1980_"$el"_200003_5.data $PRIMARY/. || scp -o ControlMaster=auto -o 'ControlPath=~/.ssh/control-%r@%h:%p' -o ControlPersist=900 david@${FILERS_LOCATION[1]}:$dir1/s5_daily_1980_"$el"_200003_5.data $PRIMARY/.
}
export -f do_copy
parallel -j 5 do_copy ::: "${PRIMARY_PARTITION[@]}"

To learn more:

10 seconds installation:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • There's no need to run an untrusted shell script for installing this. It's part of every Linux distribution. – Sigi May 04 '14 at 22:54
  • But some of them come with nasty surprises: http://stackoverflow.com/questions/16448887/gnu-parallel-not-working-at-all – Ole Tange May 04 '14 at 23:03
  • Some of those shell scripts people `wget` straight into their systems come with nasty surprises, too, I'm sure (not suggesting that yours does, mind you). – Sigi May 05 '14 at 02:50