1

I am trying to insert some number(100) of data sets into SQL server using python. I am using multi-threading to create 100 threads in a loop. All of them are starting at the same time and this is bogging down the database. I want to group my threads into set of 5 and once that group is done, I would like to start the next group of threads and so on. As I am new to python and multi-threading, any help would be highly appreciated.Please find my code below.

for row in datasets:
        argument1=row[0]
        argument2=row[1]
        jobs=[]
        t = Thread(target=insertDataIntoSQLServer, args=(argument1,argument2,))
        jobs.append(t)
        t.start()
for t in jobs:
    t.join()
user2639146
  • 43
  • 1
  • 8

3 Answers3

1

You can create a ThreadPoolExecutor and specify max_workers=5.

See here.

And you can use functools.partial to turn your functions into the required 0-argument functions.

EDIT: You can pass the args in with the function name when you submit to the executor. Thanks, Roland Smith, for reminding me that partial is a bad idea. There was a better way.

Community
  • 1
  • 1
Arya McCarthy
  • 8,554
  • 4
  • 34
  • 56
  • Thank you for your quick response. WHen I use ThreadPool Executor, Do I need to join the threadPool? – user2639146 Apr 01 '17 at 22:07
  • You don't *need* to. If your usage requires it, it's possible: http://stackoverflow.com/questions/21143162/python-wait-on-all-of-concurrent-futures-threadpoolexecutors-futures – Arya McCarthy Apr 01 '17 at 22:15
  • The functions should take *one* argument. And note that `functools.partial` should only be used for arguments that are the same for every call! – Roland Smith Apr 01 '17 at 22:16
1

On Python 2 and 3 you could use a multiprocessing.ThreadPool. This is like a multiprocessing.Pool, but using threads instead of processes.

import multiprocessing

datasets = [(1,2,3), (4,5,6)]  # Iterable of datasets.

def insertfn(data):
    pass # shove data to SQL server

pool = multiprocessing.ThreadPool()

p.map(insertfn, datasets)

By default, a Pool will create as many worker threads as your CPU has cores. Using more threads will probably not help, because they will be fighting for CPU time.

Note that I've grouped data into tuples. That is one way to get around the one argument restriction for pool workers.

On Python 3 you can also use a ThreadPoolExecutor.

Note however that on Python implementations (like the "standard" CPython) that have a Global Interpreter Lock, only one thread at a time can be executing Python bytecode. So using large numbers of threads will not automatically increase performance. Threads might help with operations that are I/O bound. If one thread is waiting for I/O, another thread can run.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Thanks for your response..I have tried using pool.map, but it accepts only 1 argument. I need to send multiple arguments to my insertfn. I havent tried using functools.partial yet to turn my function into zero argument function. – user2639146 Apr 01 '17 at 22:11
  • What you also could to is group the arguments into a tuple. This tuple then counts as one argument. :-) You should use `functools.partial` for arguments that are the same for every call of the function. – Roland Smith Apr 01 '17 at 22:13
  • `ThreadPoolExecutor` has been [back ported](https://pypi.python.org/pypi/futures) to Python 2.7. – Arya McCarthy Apr 01 '17 at 22:23
1

First note that your code doesn't work as you intended: it sets jobs to an empty list every time through the loop, so after the loop is over you only join() the last thread created.

So repair that, by moving jobs=[] out of the loop. After that, you can get exactly what you asked for by adding this after t.start():

if len(jobs) == 5:
    for t in jobs:
        t.join()
    jobs = []

I'd personally use some kind of pool (as other answers suggest), but it's easy to directly get what you had in mind.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132