1

I have a large data array (about million rows) where each row consists of coordinates of a quantity. I am using KDtree algorithm in scipy.spatial to group these coordinates in predefined cells and append the cell no. at the end of each row. I am using mulitprocessing.pool to process a slice of array. The no. of slices equals no. of cores of the system. The code takes around a minute to process. Can I increase the speed with some modifications or I have reached the limit. I have to process around 900 such arrays so time is really of essence here ;)

Relevant part of my code is as follows,

  #function called by external code
  def AnodeAssignment(sim_data_dir,filename,energy_data_dir):
      argv = sim_data_dir + filename
      global data
      data = np.loadtxt(argv,str,usecols=(0,2,5,6,8,9)) #load the array from file
      good  = np.where(data[:,2]=='phot')
      data = data[good]
      data = np.delete(data,2,1).astype(float)
      slice_width = data.shape[0]/multiprocessing.cpu_count()
      #slice array based on no. of core
      data_slice = []
      for i in range(multiprocessing.cpu_count()):
         data_slice.append(data[i*slice_width:(i+1)*slice_width])
      data_slice = np.array(data_slice)
      #pool processes 
      pool = multiprocessing.Pool()
      results = pool.map(pool_process,data_slice)
      pool.close()
      data = results[0]
      for i in range(1,multiprocessing.cpu_count()): #combine results
         data = np.vstack((data,results[i]))

  #function to process each slice using KDtree algo.
  def pool_process(data_slice):
      y = np.linspace(-19.5, 19.5, 14 ,endpoint=True)
      z = np.linspace(-9, 6, 6, endpoint=True)
      yv, zv = np.meshgrid(y,z)
      tree = spatial.KDTree(zip(yv.ravel(),zv.ravel()))
      dist, indices = tree.query(data_slice[:,3:])
      zz = tree.data[indices][:,1]
      yy = tree.data[indices][:,0]

      anode = np.ndarray((data_slice[:,0].size,1),int)

      a1  = np.where(np.logical_and(np.in1d(yy,y[1:-1:2]),np.in1d(zz ,[6])))
      a2  = np.where(np.logical_and(np.in1d(yy,y[2:-1:2]),np.in1d(zz ,[6])))
      a3  = np.where(np.logical_and(np.in1d(yy,y[1:-1:2]),np.in1d(zz ,[3])))
      a4  = np.where(np.logical_and(np.in1d(yy,y[2:-1:2]),np.in1d(zz ,[3])))
      a5  = np.where(zz==0)
      a6  = np.where(zz==-3)
      a7  = np.where(zz==-6)
      a8  = np.where(yy==-19.5)
      a9  = np.where(zz==-9)
      a10 = np.where(yy==19.5)

      anode[a1]  = 1
      anode[a2]  = 2
      anode[a3]  = 3
      anode[a4]  = 4
      anode[a5]  = 5
      anode[a6]  = 6
      anode[a7]  = 7
      anode[a8]  = 8
      anode[a9]  = 9
      anode[a10] = 10

      data_slice = np.hstack((data_slice,anode))
      return data_slice
Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
Sujay Mate
  • 31
  • 4
  • What is the bottleneck? Also, is it Python 2 or 3? And if this is Python 2, have you considered using `itertools.izip` instead of `zip`? – Andrea Corbellini Dec 28 '15 at 09:40
  • I am not sure of bottleneck...thts what I want to know if anyone can point it out... Using python 2.7... how is itertools.izip is better than zip? – Sujay Mate Dec 28 '15 at 09:57
  • Try using [hotshot](https://docs.python.org/2/library/hotshot.html#example-usage) to find out the bottleneck. [This question](http://stackoverflow.com/questions/4989763/when-is-it-better-to-use-zip-instead-of-izip) is about `zip` vs `izip`. – Andrea Corbellini Dec 28 '15 at 10:12
  • But anyhow, I bet that the bottleneck is in the `spatial.KDTree` line (even though I'm not a numpy expert). Instead of learning how to use hotshot, try inserting two `print` statements: one above and one below the call to `KDTree` (for example: `print "start"; tree = ...; print "stop") and check if KDTree takes a minute to complete – Andrea Corbellini Dec 28 '15 at 10:21
  • Okay thanks... i'll try the print approach... – Sujay Mate Dec 28 '15 at 11:51

1 Answers1

1

Using a Pool is a good step if calculations can be done in parallel. An obvious improvement from the "brute force" category would be to borrow a machine with more cores. :-) In the same way you might be able to use a cluster of machines.

Before you can optimize, you must measure. Use a profiler on your code to see where it is spending its time. Then see if you can do things differently to make it faster.

Having said that, in general the largest improvements are generally found by using another algorithm. But since I don't have a clue about what you are trying to accomplish here I cannot offer concrete recommendations.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94