0

I was trying to follow the example @ this location:

[How to use threading in Python?

I have a sample dataframe (df) like this:

segment x_coord y_coord
a   1   1
a   2   4
a   1   7
b   2   3
b   4   3
b   8   3
c   4   4
c   2   5
c   7   8

and creating kd-tree using for loop for each of segments in loop as below:

dist_name=df['segment'].unique()
for i in range(len(dist_name)):
    a=df[df['segment']==dist_name[i]]
    tree[i] = spatial.cKDTree(a[['x_coord','y_coord']])

How can i parallelize the tree creation using the sample sighted in link as below:

results = [] 
for url in urls:
  result = urllib2.urlopen(url)
  results.append(result)

Parallelize to >>

pool = ThreadPool(4) 
results = pool.map(urllib2.urlopen, urls)

My attempt

import pandas as pd
import time
from scipy import spatial
import random
from multiprocessing.dummy import Pool as ThreadPool 


dist_name=['a','b','c','d','e','f','g','h']

df=pd.DataFrame()

for i in range(len(dist_name)):
    if i==0:
       df['x_coord']=random.sample(range(1, 10000), 1000)
       df['y_coord']=random.sample(range(1, 10000), 1000)
       df['segment']=dist_name[i]
    else:
       tmp=pd.DataFrame()
       tmp['x_coord']=random.sample(range(1, 10000), 1000)
       tmp['y_coord']=random.sample(range(1, 10000), 1000)
       tmp['segment']=dist_name[i]
       df=df.append(tmp)



start_time = time.time()
for i in range(len(dist_name)):
    a=df[df['segment']==dist_name[i]]
    tree = spatial.cKDTree(a[['x_coord','y_coord']])

print("--- %s seconds ---" % (time.time() - start_time))

--- 0.0312347412109375 seconds ---

def func(name):
    a = df[df['segment'] == name]
    return spatial.cKDTree(a[['x_coord','y_coord']])

pool = ThreadPool(4) 

start_time = time.time()
tree = pool.map(func, dist_name)
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.031250953674316406 seconds ---

muni
  • 1,263
  • 4
  • 22
  • 31
  • Can you update your question with your attempt at applying the answer from your link to your code? You need to think about what function to write (e.g. `func()`) such that you can write `pool.map(func, dist_name)`. – quamrana Jun 05 '18 at 15:36
  • updated with my approach – muni Jun 05 '18 at 15:45

1 Answers1

0

Your code:

dist_name=df['segment'].unique()
for i in range(len(dist_name)):
    a=df[df['segment']==dist_name[i]]
    tree[i] = spatial.cKDTree(a[['x_coord','y_coord']])

Needs to be transformed into:

dist_name=df['segment'].unique()

def func(name):
    a = df[df['segment'] == name]
    return spatial.cKDTree(a[['x_coord','y_coord']])

And your call to pool.map:

pool = ThreadPool(4) 
tree = pool.map(func, dist_name)
quamrana
  • 37,849
  • 12
  • 53
  • 71
  • I am not observing any performance improvement using threading compared to normal run. What might be the culprit here? – muni Jun 06 '18 at 14:16
  • I updated my answer to show, that there is no performance difference. Not sure, what's causing such behaviour – muni Jun 06 '18 at 14:34
  • Why did you use `multiprocessing.dummy`? – quamrana Jun 06 '18 at 15:16
  • this is as per the link – muni Jun 06 '18 at 15:20
  • The link also says that this uses the threading library under the covers. This will give no speed-up for CPU bound tasks. Drop the `dummy` and you should be multi processing. Beware that this may seem slower for a task that only takes 31 milliseconds anyway. Also you will need the `if __name__ == '__main__':` idiom. – quamrana Jun 06 '18 at 15:51
  • you mean change this in code: if __name__ == '__main__': tree = pool.map(func, dist_name) – muni Jun 06 '18 at 16:19
  • removing dummy let it run forever – muni Jun 06 '18 at 17:38