0

I'm trying to copy a large number of files from one directory to another. However, in trying to speed up the process by using Threading I get an error where it complains about too many open files. Currently the test batch of files is around 700+ and below is the code. How do I fix this? In my example, I'm copying files from one location on the network to another location on the same network and files range from 1mb to 100mb.

def copy_file_to_directory(file, directory):
    '''
    Description:
        Copies the file to the supplied directory if it exists
    '''
    if os.path.isfile(file):
        url = os.path.join(directory, os.path.basename(file))
        try:
            shutil.copyfile(file, url)
            shutil.copystat(file, url)
            return True
        except IOError as e:
            print (e)
            return False

def copy_files_to_directory(files, directory):
    '''
    Directory:
        Copy a list of files to directory, overwriting existing files
    '''
    if not os.path.isdir(directory):
        os.makedirs(directory)

    if not os.path.isdir(directory):
        return False

    workers = []   
    for x in files:
        if os.path.isfile(x):
            worker = threading.Thread(target=copy_file_to_directory, args=(x,directory))
            workers.append(worker.start())

    # wait until they are all done processing
    for x in workers:
        x.join()

    return True

 files = [] # list of files
 copy_files_to_directory(files, 'C:/Users/John')
Nathan majicvr.com
  • 950
  • 2
  • 11
  • 31
JokerMartini
  • 5,674
  • 9
  • 83
  • 193
  • [This](https://stackoverflow.com/questions/18280612/ioerror-errno-24-too-many-open-files) may be helpful. – DYZ Aug 02 '18 at 22:20
  • Threading will not speed up this process. The CPU and your Python program are waiting for a disc that is orders of magnitude slower. Indeed, using parallel process is likely slowing the process down since the disc is performing more random access rather than serial reads. Just use [shutil](https://docs.python.org/3/library/shutil.html) and relax. That is likely as fast as it can happen... – dawg Aug 03 '18 at 02:34
  • @dawg sadly you are false in that regard. I initially tested this on almost a 1000 files and got around 8 minutes. where as threading produced results around 1 minute or less. – JokerMartini Aug 03 '18 at 19:40
  • @dawg That advice is a couple decades out of date. For example, many people have SSDs nowadays, which have instant seeking. In practice, issuing multiple reads and writes in parallel, as the OP is doing, often does speed things up, so instead of just assuming it won't, you really want to test. – abarnert Aug 04 '18 at 20:50
  • @abarnert: Please see timings. The OP states *I'm copying files from one location on the network to another location on the same network* so the bandwidth constraint is network speed. The same process applies. Just like 9 people cannot have a baby faster than one person can (ie, the limiting speed is 9 month human gestation period), 9 threads over ethernet are not 9 times faster than 1 direct copy over ethernet. – dawg Aug 05 '18 at 05:50
  • @JokerMartini: Please be sure you are actually seeing the results of copying the files vs some other OS function (decloning or target's cache being two examples) that appears as if you have copied a new file but actually didn't. You state you got an 8x speedup from using threading. If that is from one homogenous source to one homogenous destination over one I/O pathway, that is not credible. – dawg Aug 05 '18 at 13:32

2 Answers2

3

You almost certainly don't want one to spawn a thread per file. To the extent that threading gives you a benefit (and you're not just saturating your disk I/O bandwidth anyhow), you should probably just use a thread pool (e.g. concurrent.futures.ThreadPoolExecutor) with a fixed number of threads. This will limit the number of files open at once. In fact this case is given as an example in the Python docs: https://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.Executor.shutdown

Adapting this to your use:

with ThreadPoolExecutor(max_workers=4) as e:
    for x in files:
        if os.path.isfile(x):
            e.submit(copy_file_to_directory, x, directory)
abarnert
  • 354,177
  • 51
  • 601
  • 671
Jeremy Roman
  • 16,137
  • 1
  • 43
  • 44
  • Great answer—but, for future reference, format code by indenting four spaces (or using the `{}` icon or Ctrl K), not with triple backticks. Otherwise all of your indentation gets lost (especially bad for Python code). – abarnert Aug 03 '18 at 02:27
  • Thanks for the formatting fix. It's been too long since I posted to SO. :P – Jeremy Roman Aug 04 '18 at 17:43
1

I performed some timing of thread pools (both 4 thread pool and 8 thread pool) vs straight shutil vs OS copy of the files (ie, not in Python).

The target device was one of:

  1. a local spinning hard drive;
  2. a fast external SSD with thunderbolt 3 interface;
  3. a SMB network mountpoint with an SSD on the mount device and 1000 base T interface.

The source device was a very fast Mac internal SSD capable of 8K video editing, so much faster than any of the target devices.

First create 100 random data files between 1 MB and 100MB:

#!/bin/bash
cd /tmp/test/src   # a high bandwidth source SSD

for fn in {1..100}.tgt 
do 
   sz=$(( (1 + RANDOM % 100)*1000*1000 ))
   printf "creating %s with %s MB\n" "$fn" $((sz/(1000*1000) ))
   head -c "$sz" </dev/urandom >"$fn"
done

Now the timing code:

import shutil
import os
import pathlib
import concurrent.futures
import random 

def copy_file_to_directory(file, directory):
    '''
    Description:
        Copies the file to the supplied directory if it exists
    '''
    if os.path.isfile(file):
        url = os.path.join(directory, os.path.basename(file))
        try:
            shutil.copyfile(file, url)
            shutil.copystat(file, url)
            return True
        except IOError as e:
            print (e)
            return False

def f1(files, directory):
    '''
    Directory:
        Copy a list of files to directory, overwriting existing files
    '''

    if not os.path.isdir(directory):
        os.makedirs(directory)

    if not os.path.isdir(directory):
        return False

    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as e:
        for x in files:
            if os.path.isfile(x):
                e.submit(copy_file_to_directory, x, directory)
    return True     

def f2(files, directory):
    '''
    Directory:
        Copy a list of files to directory, overwriting existing files
    '''

    if not os.path.isdir(directory):
        os.makedirs(directory)

    if not os.path.isdir(directory):
        return False

    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as e:
        for x in files:
            if os.path.isfile(x):
                e.submit(copy_file_to_directory, x, directory)
    return True     

def f3(files, p):
    '''
    Serial file copy using copy_file_to_directory one file at a time
    '''
    for f in files:
        if os.path.isfile(f):
            copy_file_to_directory(f, p)

if __name__=='__main__':
    import timeit
    src='/tmp/test/src'
    cnt=0
    sz=0
    files=[]
    for fn in pathlib.Path(src).glob('*.tgt'):
        sz+=pathlib.Path(fn).stat().st_size
        cnt+=1
        files.append(fn)
    print('{:,.2f} MB in {} files'.format(sz/(1000**2),cnt))    

    for case, tgt in (('Local spinning drive','/Volumes/LaCie 2TB Slim TB/test'),('local SSD','/Volumes/SSD TM/test'),('smb net drive','/Volumes/andrew/tgt-DELETE')):  
        print("Case {}=> {}".format(case,tgt))
        for f in (f1,f2,f3):
            print("   {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(files, tgt)", setup="from __main__ import f, files, tgt", number=1)))  

The results were:

4,740.00 MB in 100 files
Case Local spinning drive=> /Volumes/LaCie 2TB Slim TB/test
       f1    56.7113 secs
       f2    71.2465 secs
       f3    46.2672 secs
Case local SSD=> /Volumes/SSD TM/test
       f1    9.7915 secs
       f2    10.2333 secs
       f3    10.6059 secs
Case smb net drive=> /Volumes/andrew/tgt-DELETE
       f1    41.6251 secs
       f2    40.9873 secs
       f3    51.3326 secs

And compare with raw unix copy times:

$ time cp /tmp/test/src/*.* "/Volumes/LaCie 2TB Slim TB/test"
real    0m41.127s

$ time cp /tmp/test/src/*.* "/Volumes/SSD TM/test"
real    0m9.766s

$ time cp /tmp/test/src/*.* "/Volumes/andrew/tgt-DELETE"
real    0m49.993s

As I suspected, the times (at least for MY tests) are all roughly the same since the limiting speed is the underlying I/O bandwidth. There was some advantage with thread pools for a network device with a tradeoff of a substantial disadvantage on a mechanical drive.

These results are only for copying from one homogenous location of files to another homogenous location with no processing of the individual files. If the steps involves some CPU intensive functions on a per file basis or the destination for the individual files involved different I/O paths (ie, one file to the SSD and based on some condition the next file to the network etc), that might favor using a concurrent approach.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • Why is the unix copy times so much faster than using the shutil copy? In my case I have a string list of files I need to copy from one location to another as fast as i can. – JokerMartini Aug 06 '18 at 12:07
  • Well the unix copy is not that much faster. 1.3 seconds faster out of 50 seconds for the network cp. Since the limiting speed is the speed of the network, all these approaches will be similar except coordinated compression and patching like `rsync`. Can you use `rsync`? Are the files compressible? That is as fast as it can be. The local files are compressed and sent the `rsync` on the other side which 'patches' the files on the other server. Since you said you are often copying over files, this might be orders of magnitude faster. This is how Dropbox works. – dawg Aug 06 '18 at 13:44
  • let me try out the approaches above and see what results i get and then ill ping this thread based on that. i really appreciate your help with all of this. – JokerMartini Aug 07 '18 at 00:32
  • The code I wrote here for the timing used Jeremy Romans code btw. – dawg Aug 07 '18 at 01:26