4

I am computing some data for every year that is relatively computationally intensive. I have used numba (to great effect) to reduce the time taken to run iterations to compute the data. However given I have 20 years of independent data, I would like to split them into 5 x groups of 4 that could run over 4 different cpu cores.

def compute_matrices(self):
    for year in self.years:
         self.xs[year].compute_matrix()

In the above code-snippet, the function is a method within a Class that contains attributes year and xs. year is simply an integer year, and xs is a cross-section object that houses the xs.data and the compute_matrix() method.

What is the easiest way to split this across multiple cores?

  1. It would be great if there were a Numba style decorater that could automatically break up the loops and run them over different processes and glue the results together. Does this exist?

  2. Is my best bet using Python.multiprocessing?

sanguineturtle
  • 1,425
  • 2
  • 15
  • 29

2 Answers2

3

So there are a couple of things you could look at for this:

NumbaPro: https://store.continuum.io/cshop/accelerate/. This is basically Numba on steroids, providing support for many- and multicore architectures. Unfortunately it is not cheap.

Numexpr: https://code.google.com/p/numexpr/ . This is an expression evaluator for numpy arrays that implements hyperthreading.

Numexpr-Numba (experimental): https://github.com/gdementen/numexpr-numba . As the name suggests this is Numexpr using a Numba backend.

A lot of the answer will depend on what is done in your compute_matrix method.

The fastest (in terms of development time) solution would probably be to just split your computations using the multiprocessing library. It should be noted that it will be easier to use multiprocessing if your compute_matrix function has no side effects.

ebarr
  • 7,704
  • 1
  • 29
  • 40
  • The compute_matrix() has a bunch of code in it that doesn't work with ``@autojit`` due to heavy reliance on DataFrame objects etc. However Underneath the various steps I have turned the computational loops into separate routines with the @autojit decorator which works really well. Now I Would like to just farm out the year by year work in parallel. – sanguineturtle Apr 06 '14 at 22:59
  • To do this you will need to change a few things. The object being passed to a `Process` instance must be pickleable. See this question for some more info: http://stackoverflow.com/questions/1816958/cant-pickle-type-instancemethod-when-using-pythons-multiprocessing-pool-ma – ebarr Apr 06 '14 at 23:33
1

The easiest method I have come across for complex objects is to leverage the IPython Parallel Computing Engine.

Simply get an Ipython Cluster running using: ipcluster start -n 4 or use the notebook

Then you can iterate over xs objects assigned to different clients.

def multicore_compute_matrices(self):
    from IPython.parallel import Client
    c = Client()
    xs_list = []
    years = sorted(self.years)
    # - Ordered List of xs Objects - #
    for year in years
         xs_list.append(self.xs[year])
    # - Compute across Clusters - #
    results = c[:].map_sync(lambda x: x.compute_matrix(), xs_list)
    # - Assign Results to Current Object - #
    year = years[0]
    for result in results:
        self.xs[year].matrix = result
        year += 1

Wall Time %time results:

%time A.compute_matrices()
Wall Time: 5.53s

%time A.multicore_compute_matrices():
Wall Time: 2.58s
sanguineturtle
  • 1,425
  • 2
  • 15
  • 29
  • Is that a 2 times speed up with 4 cores? You may wish to investigate why the code doesn't scale more efficiently. – ebarr Apr 07 '14 at 03:34
  • It's a good point, one that I will look into further ... this method may have relatively high overheads. – sanguineturtle Apr 07 '14 at 04:10