0

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.

Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.

def fit_dist(distribution, data=data, bins=200, ax=None):

    #Block of code that tests the distribution and generates params

    return(distribution.name, best_params, sse)

if __name__ == '__main__':

    p = Pool()

    result = p.map(fit_dist, DISTRIBUTIONS)

    p.close()
    p.join()

I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.

The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?

  • I don't know much about `scipy` but can tell you that you don't actually need the for loop since this is what the pool is for. The way `map` works is that it sends to the function (1st arg) each element from the iterable (2nd arg). So you could just do `result = p.map(fit_dist, DISTRIBUTIONS)`. Secondly, `map` returns a list with the results of all workers, so all your data will be in `result`. Hope that helps – Tomerikoo Jul 13 '19 at 00:58
  • 1
    @Tomerikoo Thank you so much! This fixed the issue with iterating through the `scipy` objects. – Logan Yoder Jul 13 '19 at 01:34
  • Happy to help! As to your update, I would suggest either editing the whole question to fit your new scenario (as the code now don't match), or opening a new question (and even closing this one...) – Tomerikoo Jul 13 '19 at 10:17

1 Answers1

0

With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.

from multiprocessing import Pool

def fit_dist:
    #put this return under the right section of this method
    return[distribution.name, params, sse]

if __name__ == '__main__':

    p = Pool()

    result = p.map(fit_dist, DISTRIBUTIONS)

    p.close()
    p.join()

    '''filter out the None object results. Due to the nature of the distribution fitting, 
    some distributions are so far off that they result in None objects'''
    res = list(filter(None, result))

    #iterates over nested list storing the lowest sum of squared errors in best_sse

    for dist in res:
        if best_sse > dist[2] > 0:
            best_sse = dis[2]
        else:
            continue
    '''iterates over list pulling out sublist of distribution with best sse. 
    The sublists are made up of a string, tuple with parameters, 
    and float value for sse so that's why sse is always index 2.'''   

    for dist in res:
        if dist[2]==best_sse:
            best_dist_list = dist
        else:
            continue

The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.