Improve the speed of the script with threads

Question

I am trying this code, and it works well, however is really slow, because the number of iterations is high.

I am thinking about threads, that should increase the performance of this script, right? Well, the question is how can I change this code to works with synchronized threads.

def get_duplicated(self):
    db_pais_origuem = self.country_assoc(int(self.Pais_origem))
    db_pais_destino = self.country_assoc(int(self.Pais_destino))
    condicao = self.condition_assoc(int(self.Condicoes))

    origem = db_pais_origuem.query("xxx")
    destino = db_pais_destino.query("xxx")

    origem_result =  origem.getresult()
    destino_result =  destino.getresult()

    for i in origem_result:
        for a in destino_result:
            text1 = i[2]
            text2 = a[2]

            vector1 = self.text_to_vector(text1)
            vector2 = self.text_to_vector(text2)

            cosine = self.get_cosine(vector1, vector2)

origem_result and destino_result structure:

[(382360, 'name abcd', 'some data'), (361052, 'name abcd', 'some data'), (361088, 'name abcd', 'some data')]

What does this script do? Where do the items you iterate over come from? — , Nov 18 '13 at 15:53
@LutzHorn The script will generate a cosine value. The data come from postgres. — daniel__, Nov 18 '13 at 15:56
@loops would you be kind enough to show a data sample for `origem_result` and `destino_result`? otherwise you wouldn't get a answer even if you offer 500 bounty... — K DawG, Nov 18 '13 at 15:57
@loops just for my curiosity, *how da heck do you convert text to a vector?* — K DawG, Nov 18 '13 at 16:03
@KDawG http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python — daniel__, Nov 18 '13 at 16:06
As long as the functions you're using are CPU-bound, threading **doesn't** improve performance at all. In fact, it makes it worse. What you want is multi-*processing*. Also see http://stackoverflow.com/questions/8430899/improving-python-execution-speed-with-parallel-threads. — jazzpi, Nov 18 '13 at 16:42
@jazzpi, thanks for the info. So, what i am looking for is multiprocessing. — daniel__, Nov 18 '13 at 16:45

score 1 · Answer 1 · answered Nov 19 '13 at 15:26

From what I can see you are computing a distance function between pairs of vectors. Given a list of vectors, v1, ..., vn, and a second list w1,...wn you want the distance/similarity between all pairs from v and w. This is usually highly amenable to parallel computations, and is sometimes referred to as an embarassingly parallel computation. IPython works very well for this.

If your distance function distance(a,b) is independent and does not depend on results from other distance function values (this is usually the case that I have seen), then you can easily use ipython parallel computing toolbox. I would recommend it over threads, queues, etc... for a wide variety of tasks, especially exploratory. However, the same principles could be extended to threads or queue module in python.

I recommend following along with http://ipython.org/ipython-doc/stable/parallel/parallel_intro.html#parallel-overview and http://ipython.org/ipython-doc/stable/parallel/parallel_task.html#quick-and-easy-parallelism It provides a very easy, gentle introduction to parallelization.

In the simple case, you simply will use the threads on your computer (or network if you want a bigger speed up), and let each thread compute as many of the distance(a,b) as it can.

Assuming a command prompt that can see the ipcluster executable command type

    ipcluster start -n 3

This starts the cluster. You will want to adjust the number of cores/threads depending on your specific circumstances. Consider using n-1 cores, to allow one core to handle the scheduling.

The hello world examples goes as follows:

serial_result = map(lambda z:z**10, range(32))
from IPython.parallel import Client
rc = Client()
rc
rc.ids
dview = rc[:] # use all engines

parallel_result = dview.map_sync(lambda z: z**10, range(32))
#a couple of caveats, are this template will not work directly 
#for our use case of computing distance between a matrix (observations x variables)
#because the allV data matrix and the distance function are not visible to the nodes

serial_result == parallel_result

For the sake of simplicity I will show how to compute the distance between all pairs of vectors specified in allV. Assume that each row represents a data point (observation) that has three dimensions.

Also I am not going to present this the "pedagoically corret" way, but the way that I stumbled through it wrestling with the visiblity of my functions and data on the remote nodes. I found that to be the biggest hurdle to entry

dataPoints = 10
allV = numpy.random.rand(dataPoints,3)
mesh = list(itertools.product(arange(dataPoints),arange(dataPoints)))

#given the following distance function we can evaluate locally 
def DisALocal(a,b):
  return numpy.linalg.norm(a-b)

serial_result = map(lambda z: DisALocal(allV[z[0]],allV[z[1]]),mesh)

parallel_result = dview.map_sync(lambda z: DisALocal(allV[z[0]],allV[z[1]]),mesh)
#will not work as DisALocal is not visible to the nodes
#also will not work as allV is not visible to the nodes

There are a few ways to define remote functions.
Depending on whether we want to send our data matrix to the nodes or not. There are tradeoffs as to how big the matrix is, whether you want to send lots of vectors individually to the nodes or send the entire matrix upfront...

#in first case we send the function def to the nodes via autopx magic
%autopx
def DisARemote(a,b):
    import numpy
    return numpy.linalg.norm(a-b)
%autopx

#It requires us to push allV.  Also note the import numpy in the function 
dview.push(dict(allV=allV))
parallel_result = dview.map_sync(lambda z: DisARemote(allV[z[0]],allV[z[1]]),mesh)

serial_result == parallel_result

#here we will generate the vectors to compute differences between
#and pass the vectors only, so we do not need to load allV across the
#nodes. We must pre compute the vectors, but this could, perhaps, be 
#done more cleverly
z1,z2 = zip(*mesh)
z1 = array(z1)
z2 = array(z2)
allVectorsA = allV[z1]
allVectorsB = allV[z2]

@dview.parallel(block=True)
def DisB(a,b):
  return numpy.linalg.norm(a-b)

parallel_result = DisB.map(allVectorsA,allVectorsB)
serial_result == parallel_result

In the final case we will do the following

#this relies on the allV data matrix being pre loaded on the nodes.
#note with DisC we do not import numpy in the function, but
#import it via sync_imports command
with dview.sync_imports():
    import numpy

@dview.parallel(block=True)

def DisC(a):
  return numpy.linalg.norm(allV[a[0]]-allV[a[1]])
#the data structure must be passed to all threads
dview.push(dict(allV=allV))
parallel_result = DisC.map(mesh)

serial_result == parallel_result

All the above can be easily extended to work in a load balanced fashion

Of course, the easiest speedup (assuming if distance(a,b) = distance(b,a)) would be the following. It will only cut run time in half, but can be used with the above parallelization ideas to compute only the upper triangle of the distance matrix.

    for vIndex,currentV in enumerate(v):
      for wIndex,currentW in enumerate(w):
        if vIndex > wIndex:
          continue#we can skip the other half of the computations
        distance[vIndex,wIndex] = get_cosine(currentV, currentW)
        #if distance(a,b) = distance(b,a) then use this trick
        distance[wIndex,vIndex] = distance[vIndex,wIndex]

Improve the speed of the script with threads

1 Answers1