Uploading files to s3 using s3cmd in parallel

Question

I've got a whole heap of files on a server, and I want to upload these onto S3. The files are stored with a .data extension, but really they're just a bunch of jpegs,pngs,zips or pdfs.

I've already written a short script which finds the mime type and uploads them onto S3 and that works but it's slow. Is there any way to make the below run using gnu parallel?

#!/bin/bash

for n in $(find -name "*.data") 
do 
        data=".data" 
        extension=`file $n | cut -d ' ' -f2 | awk '{print tolower($0)}'` 
        mimetype=`file --mime-type $n | cut -d ' ' -f2`
        fullpath=`readlink -f $n`

        changed="${fullpath/.data/.$extension}"

        filePathWithExtensionChanged=${changed#*internal_data}

        s3upload="s3cmd put -m $mimetype --acl-public $fullpath s3://tff-xenforo-data"$filePathWithExtensionChanged     

        response=`$s3upload`
        echo $response 

done

Also I'm sure this code could be greatly improved in general :) Feedback tips would be greatly appreciated.

Nod I could have written something in go or another language, but I was trying to do it "all in bash".. for no particular reason. — Alan Hollis, Nov 14 '14 at 19:09
[Possible solution here](http://blog.aclarke.eu/moving-copying-lots-of-s3-files-quickly-using-gnu-parallel/) — helloV, Nov 14 '14 at 19:13
Aye :) I read that too, but what I couldn't work out was the important part of it, which would be how I'd create the list of files to send into parallel. — Alan Hollis, Nov 14 '14 at 19:14
Can't you redirect `find -name "*.data"` to a file and pass that file to parallel? `find -name "*.data" > mydata.txt` `parallel -j5 "doit {}" < mydatatxt` — helloV, Nov 14 '14 at 21:33

score 13 · Accepted Answer · answered Nov 14 '14 at 22:25

13

You are clearly skilled in writing shell, and extremely close to a solution:

s3upload_single() {
    n=$1
    data=".data" 
    extension=`file $n | cut -d ' ' -f2 | awk '{print tolower($0)}'` 
    mimetype=`file --mime-type $n | cut -d ' ' -f2`
    fullpath=`readlink -f $n`

    changed="${fullpath/.data/.$extension}"

    filePathWithExtensionChanged=${changed#*internal_data}

    s3upload="s3cmd put -m $mimetype --acl-public $fullpath s3://tff-xenforo-data"$filePathWithExtensionChanged     

    response=`$s3upload`
    echo $response 
}
export -f s3upload_single
find -name "*.data" | parallel s3upload_single

answered Nov 14 '14 at 22:25

Ole Tange

31,768
5
86
104

Awesome Thank you! if I'm reading this right, would this run all the files that `find - name "*.data"` returns, and run every single one in parallel? If so that's really cool, but I'm assuming it would fall over if `find -name "*.data"` returned say uhh 80k files – Alan Hollis Nov 15 '14 at 12:35
1

Almost: `parallel` defaults to one process per cpu core. If you want to run as many as possible use `parallel -j0`. This will still not run 80k in parallel, but stop spawning more when there are no more file handles or processes left. – Ole Tange Nov 15 '14 at 20:21
1

You can also use xargs for this. Just change the last line to `find -name '*.data" | xargs -n 1 -P 10 s3upload_single`. See this link: http://www.xaprb.com/blog/2009/05/01/an-easy-way-to-run-many-tasks-in-parallel/ – Jeff Wu Apr 06 '16 at 19:19

score 3 · Answer 2 · answered Jun 15 '16 at 23:50

3

you can just use s3cmd-modified which allows you to put/get/sync with multiple workers in parallel

$ git clone https://github.com/pcorliss/s3cmd-modification.git $ cd s3cmd-modification $ python setup.py install $ s3cmd --parallel --workers=4 sync /source/path s3://target/path

answered Jun 15 '16 at 23:50

Ibrahim Albarki

119
6

2

This project is 9 years old as of last commit :/ – rjurney Jun 17 '19 at 15:55

score -1 · Answer 3 · answered Aug 22 '17 at 22:01

-1

Use aws cli. It supports parallel upload of files and it is really fast while uploading and downloading.

http://docs.aws.amazon.com/cli/latest/reference/s3/

answered Aug 22 '17 at 22:01

Hitul

363
1
5
11

1

The word 'parallel' does not appear in the docs. How is this possible? – rjurney Jun 17 '19 at 16:01

score -1 · Answer 4 · answered May 02 '18 at 10:39

Try s3-cli: Command line utility frontend to node-s3-client. Inspired by s3cmd and attempts to be a drop-in replacement.

Paraphrasing from https://erikzaadi.com/2015/04/27/s3cmd-is-dead-long-live-s3-cli/ :

This is a inplace replace to s3cmd, written in node (yaay!), which works flawlessly with the existing s3cmd configuration, which (amongs other awsome stuff), uploads to S3 in parallel, saving LOADS of time.
-        system "s3cmd sync --delete-removed . s3://yourbucket.com/"
+        system "s3-cli sync --delete-removed . s3://yourbucket.com/"

Uploading files to s3 using s3cmd in parallel

4 Answers4

Linked