1

This function copies files from one folder to another acording to filetype. The problem is when the number of files is so big that it takes too long to copy. Maybe there is another way of doing it? Using another library/language/syntax?

def main_copy(source, destination):

    # List of all files inside directory
    files_fullpath = [f for f in listdir(source) 
                        if isfile(join(source, f))] 

    # Copy files to the correct folder according to filetype
    if len(files_fullpath) != 0:
        for fs in files_fullpath:
            full_file = source + "\\" + fs
            if str(fs).endswith('.ARW'):
                shutil.copy(full_file, raw_folder + "\\" + fs)
            else:
                shutil.copy(full_file, jpg_folder + "\\" + fs)
        if len(listdir(destination)) != 0:
            print("Files moved succesfully!")
  • 1
    for starters, `if len(files_fullpath) != 0` seems unnecessary – timgeb Aug 24 '22 at 14:37
  • If you're on Linux/Unix you can use command line tools like `find` and `grep` to do this. Otherwise, time how long `listdir(source)` takes-- I'm guessing that's the bottleneck – Barry Carter Aug 24 '22 at 14:38
  • @BarryCarter I doubt the bottleneck is finding the files and not the actual process of copying itself. – matszwecja Aug 24 '22 at 14:38
  • @matszwecja Not if it's a large directory. If copying is the bottleneck, it's probably unresolvable – Barry Carter Aug 24 '22 at 14:42
  • There isn't really any problem algorithmically. If there is a lot of files to copy, long times are to be expected and the best way to speed things up might be change of hardware. I'd check if copying the files using the OS takes similiar lengths - if so there isn't much to be gained from optimising the code. – matszwecja Aug 24 '22 at 14:43
  • @BarryCarter RAW (`.arw`) files are image files that can be expected to have filesizes of 20-100MB each, so I doubt listdir execution time would be significant compared to time it takes to copy each file. – matszwecja Aug 24 '22 at 14:52
  • Fair enough.... – Barry Carter Aug 24 '22 at 14:54

2 Answers2

0

A good idea to speed up your procedure would be to parallelize the computation up to the maximum disk write speed.

You could use the multiprocessing library and decide whether to use multithreading (multiprocessing.pool.ThreadPool) pools or multiprocessing (multiprocessing.Pool).

I'll show you an example using multithreading, but multiprocessing might also be a good choice (depends on many factors).

import os
import glob
import shutil
from functools import partial
from multiprocessing.pool import ThreadPool

def multi_copy(source, destination, n=8):
    # copy_to_mydir will copy any file you give it to destination dir
    copy_to_mydir = partial(shutil.copy, dst=destination)

    # list of files we want to copy
    to_copy = glob.glob(os.path.join(source, '*'))

    with ThreadPool(n) as p:
        p.map(copy_to_mydir, to_copy)

Note: If you want to go deeper into multiprocessing vs multithreading, you can read this beautiful post.

Note 1: The bottleneck of this problem should be the writes to disk. It depends on the machine on which you are running the code, probably on some machines you will not get any benefit since one process can saturate the write-to-disk capacity. In many cases it may be advantageous to write to disk in parallel instead.

Note 2: The greatest benefits from writing in parallel should be obtained when you have many small files.

Massifox
  • 4,369
  • 11
  • 31
-1

You can try to use threading or multiprocessing to speed up copying. I would advise to use concurrent.futures module with either ThreadPoolExecutor or ProcessPoolExecutor.

SystemSigma_
  • 1,059
  • 4
  • 18
  • If the delay really is file transfer speed (and I'm not saying it is), would that help? – Barry Carter Aug 24 '22 at 14:44
  • I don't think that would help normally should the R/W Speed of the drive be the bottleneck Threads arent truly parallell because of GIL in Python. If you want to make it concurrent you should use Process for Parallelization which should be also affected by a bottleneck of R/W Speeds of the drive itself – jack Aug 24 '22 at 14:48
  • Yes, if the bottleneck is R/W speed is impossible to speed this up. You cannot copy faster than your R/w speed. – SystemSigma_ Aug 24 '22 at 14:50