How to use functools.reduce to improve the performance of populating a dictionary?

Question

I am new to the field of parallelizing and optimizing data mining modules in Python and I have a question about parallelizing populating a dictionary. I am actually doing an inverted indexing using values scored in a two dimensional matrix m. The code works fine but I would like to apply python reduce to make it run faster.

Here is my code:

def createInvertedIndex(matrix):
    dic={}
    for i in range(len(matrix[0])):
        if matrix[1][i] in dic.keys():
            dic[matrix[1][i]].append(matrix[0][i])
        else: 
            dic[matrix[1][i]]=list([matrix[0][i]]) 
    return dic

because I want to use it for multiprocessing. The original code is not taking advantage of multiprocessing. — user3089485, Mar 13 '20 at 18:18
What does it have to do with multiprocessing, though? I mean, sure, you can use it in multiprocessing, but you can also use a loop in multiprocessing, and probably the loop is going to be faster and more appropriate. — Kelly Bundy, Mar 13 '20 at 18:20
Do you have a solution for running the above code faster in a multi core environment? Can you elaborate your solution? — user3089485, Mar 13 '20 at 18:26
`functools.reduce` has nothing to do, implementation-wise, with MapReduce. It's not innately parallel, it's just a loop with incremental application of a function with the accumulated value and each new value from the input iterable. — ShadowRanger, Mar 13 '20 at 18:31
@HeapOverflow I am trying to use the approach from https://lerner.co.il/2014/05/11/creating-python-dictionaries-reduce/ — user3089485, Mar 13 '20 at 18:36
I don't expect this code to be faster in parallel in python. With the `threading` module it will not be faster due to the GIL. With the `multiprocessing` module it will also probably not be faster due to the data transfer needed between processes. Why not using the `numpy` module here? It could give you a good speed-up compare to this dict-base pure-python algorithm. — Jérôme Richard, Mar 13 '20 at 20:07
So the purpose here is to populate an `inverted index` dictionary dic that maps tokens to the list of documents. Not sure how I can use `numpy` here. — user3089485, Mar 13 '20 at 20:46
If you want to know how to use `functools.reduce()` you should read the docs, and maybe a few tutorials. As others have already said, this has nothing to do with parallelism. I’m voting to close this question. Please see [tour], [ask], [help/on-topic]. — AMC, Mar 14 '20 at 03:22
@user3089485 What you are trying to do looks like [this](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) where the groupby keys are in `matrix[1]` and the values are in `matrix[0]`. — Jérôme Richard, Mar 14 '20 at 17:46

How to use functools.reduce to improve the performance of populating a dictionary?

0 Answers0