I performed some timing of thread pools (both 4 thread pool and 8 thread pool) vs straight shutil
vs OS copy of the files (ie, not in Python).
The target device was one of:
- a local spinning hard drive;
- a fast external SSD with thunderbolt 3 interface;
- a SMB network mountpoint with an SSD on the mount device and 1000 base T interface.
The source device was a very fast Mac internal SSD capable of 8K video editing, so much faster than any of the target devices.
First create 100 random data files between 1 MB and 100MB:
#!/bin/bash
cd /tmp/test/src # a high bandwidth source SSD
for fn in {1..100}.tgt
do
sz=$(( (1 + RANDOM % 100)*1000*1000 ))
printf "creating %s with %s MB\n" "$fn" $((sz/(1000*1000) ))
head -c "$sz" </dev/urandom >"$fn"
done
Now the timing code:
import shutil
import os
import pathlib
import concurrent.futures
import random
def copy_file_to_directory(file, directory):
'''
Description:
Copies the file to the supplied directory if it exists
'''
if os.path.isfile(file):
url = os.path.join(directory, os.path.basename(file))
try:
shutil.copyfile(file, url)
shutil.copystat(file, url)
return True
except IOError as e:
print (e)
return False
def f1(files, directory):
'''
Directory:
Copy a list of files to directory, overwriting existing files
'''
if not os.path.isdir(directory):
os.makedirs(directory)
if not os.path.isdir(directory):
return False
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as e:
for x in files:
if os.path.isfile(x):
e.submit(copy_file_to_directory, x, directory)
return True
def f2(files, directory):
'''
Directory:
Copy a list of files to directory, overwriting existing files
'''
if not os.path.isdir(directory):
os.makedirs(directory)
if not os.path.isdir(directory):
return False
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as e:
for x in files:
if os.path.isfile(x):
e.submit(copy_file_to_directory, x, directory)
return True
def f3(files, p):
'''
Serial file copy using copy_file_to_directory one file at a time
'''
for f in files:
if os.path.isfile(f):
copy_file_to_directory(f, p)
if __name__=='__main__':
import timeit
src='/tmp/test/src'
cnt=0
sz=0
files=[]
for fn in pathlib.Path(src).glob('*.tgt'):
sz+=pathlib.Path(fn).stat().st_size
cnt+=1
files.append(fn)
print('{:,.2f} MB in {} files'.format(sz/(1000**2),cnt))
for case, tgt in (('Local spinning drive','/Volumes/LaCie 2TB Slim TB/test'),('local SSD','/Volumes/SSD TM/test'),('smb net drive','/Volumes/andrew/tgt-DELETE')):
print("Case {}=> {}".format(case,tgt))
for f in (f1,f2,f3):
print(" {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(files, tgt)", setup="from __main__ import f, files, tgt", number=1)))
The results were:
4,740.00 MB in 100 files
Case Local spinning drive=> /Volumes/LaCie 2TB Slim TB/test
f1 56.7113 secs
f2 71.2465 secs
f3 46.2672 secs
Case local SSD=> /Volumes/SSD TM/test
f1 9.7915 secs
f2 10.2333 secs
f3 10.6059 secs
Case smb net drive=> /Volumes/andrew/tgt-DELETE
f1 41.6251 secs
f2 40.9873 secs
f3 51.3326 secs
And compare with raw unix copy times:
$ time cp /tmp/test/src/*.* "/Volumes/LaCie 2TB Slim TB/test"
real 0m41.127s
$ time cp /tmp/test/src/*.* "/Volumes/SSD TM/test"
real 0m9.766s
$ time cp /tmp/test/src/*.* "/Volumes/andrew/tgt-DELETE"
real 0m49.993s
As I suspected, the times (at least for MY tests) are all roughly the same since the limiting speed is the underlying I/O bandwidth. There was some advantage with thread pools for a network device with a tradeoff of a substantial disadvantage on a mechanical drive.
These results are only for copying from one homogenous location of files to another homogenous location with no processing of the individual files. If the steps involves some CPU intensive functions on a per file basis or the destination for the individual files involved different I/O paths (ie, one file to the SSD and based on some condition the next file to the network etc), that might favor using a concurrent approach.