1

I'm processing a list of dictionaries in python like so:

def process_results(list_of_dicts):
    first_result, second_result, count = [], [], 0
    for dictionary in list_of_dicts:
        first_result.append(dictionary)
        if 'pi' in dictionary:
            second_result.append(dictionary)
        count += 1
    print second_result, first_result

Next, via this simple SO example of using multiprocessing in a for loop, I'm trying the following (to completely erroneous results):

    from multiprocessing import Pool

    def process_results(list_of_dicts):
        first_result, second_result, count = [], [], 0
        for dictionary in list_of_dicts:
            first_result.append(dictionary)
            if 'pi' in dictionary:
                second_result.append(dictionary)
            count += 1
        return second_result, first_result

    if __name__ == '__main__':
        list_of_dictionaries = # a list of dictionaries
        pool = Pool()
        print pool.map(process_results, list_of_dictionaries)

Why is this wrong? An illustrative example would be nice.

Hassan Baig
  • 15,055
  • 27
  • 102
  • 205

1 Answers1

1

What you're probably looking for is this

from multiprocessing import Pool

def process_results(single_dict):
    first_result, second_result, count = [], [], 0
    first_result.append(single_dict)
    if 'pi' in single_dict:
        second_result.append(single_dict)
        count += 1
    return first_result, second_result

if __name__ == '__main__':
    lst_dict = [{'a':1, 'b':2, 'c':3},{'c':4, 'pi':3.14}, {'pi':'3.14', 'not pi':8.3143}, {'sin(pi)': 0, 'cos(pi)': 1}];
    pool = Pool()
    print pool.map(process_results, lst_dict)

pool.map executes process_results for each element in the iterable lst_dict. Since lst_dict is a list of dictionaries that means that process_results will be called for every dictionary in lst_dict using it as an argument. process_results will be processing every dictionary rather than the whole list.

process_results in this program is changed accordingly: for a given dictionary in the list, it appends the dictionary to the first_result list and then appends the dictionary to the second_result list if the 'pi' key exist. Result is a list with two sublists - one containing the dictionary and one containing either the copy of the first or an empty list if no 'pi' was found.

All this can be modified if you for instance need the first_result and second_result lists to be shared among processes.

For a better picture of how pool.map() works look at the first example in the documentation.

To retrieve the results in their original/target form of two lists you can collect the data into a list and then process it:

results = []
results = pool.map(process_results, lst_dict)

first_result = [i[0][0] for i in results]
second_result = [i[0][0] for i in results if i[1]]

results is a list of tuples. The tuples represent the result of processing of each dictionary - first element is the whole dictionary and the second is either an empty list, or the whole dictionary if 'pi' key was found. Remaining two lines retrieve that data into first_result and second_result lists.

atru
  • 4,699
  • 2
  • 18
  • 19
  • Sorry I didn't define `list_of_dictionaries` in the code I wrote (over-simplification from my actual snippet). But I do have that (edited the question). Shouldn't it have worked? – Hassan Baig Oct 19 '17 at 20:42
  • That's ok - I made my own - `lst_dict` - wasn't it something like your `list_of_dictionaries` - I mean literally a list with dictionaries? – atru Oct 19 '17 at 20:43
  • Sorry, nevermind, I now understand how you absorbed the for loop and what's actually going on. – Hassan Baig Oct 19 '17 at 20:46
  • Good, if you specify your desired output I can add some lines. But I think you can deal with it easily now even without using shared variables. – atru Oct 19 '17 at 20:49
  • Well I need to end up with two lists of dictionaries. One containing the original input, the other solely containing dictionaries with `pi` key. – Hassan Baig Oct 19 '17 at 22:21
  • Updated the answer. That is one way to do it - without using shared variables which is an alternative. – atru Oct 20 '17 at 00:17
  • So ultimately when this thing went to production, my application server went out of memory and took down two redis instances with it. I've got it back online, but I'm still waiting for all memory taken by my app server to be released. Looking back, I wonder what went wrong in the approach. This is a busy server/web app with a ton of GET requests per second. Each GET request fetches a list of latest items from my redis instances. This is the process I applied multiprocessing to. I suppose I need to study its use cases much more closely. This was a rough first experience. – Hassan Baig Oct 20 '17 at 15:37
  • Sorry to hear that. You can try limiting the allowable number of threads. I'm not sure if that's going to help but possibly. To me that unspecified thread number option in `Pool` looks scary (and potentially inefficient, even slower than the single thread), especially for a system with large lists. You can do it by using `pool = Pool(n)`. Let me know what happened. – atru Oct 20 '17 at 15:44