I am trying to write a multithreaded program in Python to accelerate the copying of (under 1000) .csv files. The multithreaded code runs even slower than the sequential approach. I timed the code with profile.py
. I am sure I must be doing something wrong but I'm not sure what.
The Environment:
- Quad core CPU.
- 2 hard drives, one containing source files. The other is the destination.
- 1000 csv files ranging in size from several KB to 10 MB.
The Approach:
I put all the file paths in a Queue, and create 4-8 worker threads pull file paths from the queue and copy the designated file. In no case is the multithreaded code faster:
- sequential copy takes 150-160 seconds
- threaded copy takes over 230 seconds
I assume this is an I/O bound task, so multithreading should help the operation speed.
The Code:
import Queue
import threading
import cStringIO
import os
import shutil
import timeit # time the code exec with gc disable
import glob # file wildcards list, glob.glob('*.py')
import profile #
fileQueue = Queue.Queue() # global
srcPath = 'C:\\temp'
destPath = 'D:\\temp'
tcnt = 0
ttotal = 0
def CopyWorker():
while True:
fileName = fileQueue.get()
fileQueue.task_done()
shutil.copy(fileName, destPath)
#tcnt += 1
print 'copied: ', tcnt, ' of ', ttotal
def threadWorkerCopy(fileNameList):
print 'threadWorkerCopy: ', len(fileNameList)
ttotal = len(fileNameList)
for i in range(4):
t = threading.Thread(target=CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
def sequentialCopy(fileNameList):
#around 160.446 seconds, 152 seconds
print 'sequentialCopy: ', len(fileNameList)
cnt = 0
ctotal = len(fileNameList)
for fileName in fileNameList:
shutil.copy(fileName, destPath)
cnt += 1
print 'copied: ', cnt, ' of ', ctotal
def main():
print 'this is main method'
fileCount = 0
fileList = glob.glob(srcPath + '\\' + '*.csv')
#sequentialCopy(fileList)
threadWorkerCopy(fileList)
if __name__ == '__main__':
profile.run('main()')