Memory Error: When using itertools.permutations on large lists

Question

The code below created a dictionary of permutations, on a list of arrays of customer ids/indices.

Test code can be run to generate a basic sample set:

import numpy as np
import pandas as pd
import itertools   

def func(u_data):
    perm_ = pd.DataFrame(itertools.permutations(u_data))         
    p_ = perm_.set_index(perm_.shape[1]-1).to_dict()
    return p_


if __name__ == "__main__":
    cust_indices =[np.array([90,91]),np.array([100,101]),np.array([68,69])]
    temp_indices = []
    temp_indices = list(map(lambda i: func(i), cust_indices))

When cust_indices increases in the number of elements in array, the code is killed on EC2 AWS (i.e. it causes a OOM/ memory error).

The code crashes at perm_ = pd.DataFrame(itertools.permutations(u_data)) when cust_indices =[np.array([90,91]),np.array([100,101]),np.array([68,69]),np.array([1234372, 1234373, 1234374, 1234375, 1234376, 1234377, 1234378,1234379, 1234380, 1234381, 1234382, 1234383, 1234384, 1234385])]

I am currently trying to optimize the code to cater for a larger dataset and prevent an OOM error by either using multi-processing or by updating line perm_ = pd.DataFrame(itertools.permutations(u_data)).

So I original used a different method to generate the temp_indices. However, this method affected the speed. To improve the speed I used list(map(lambda i: func(i), cust_indices)). It is now resulting in a crash. Would using joblib to run on multiple cores solve this issue?? — Sade, May 19 '21 at 08:13
When it crashed on AWS, it states killed at temp_indices. So you are correct. In addition, when I use htop only CPU is being utilised and it hits red at 100%. — Sade, May 19 '21 at 08:15
There might be a better way of doing this than list(map(lambda ... — Sade, May 19 '21 at 08:16
Maybe I should convert cust_indices from list of array so that I am not limited to using list(map(lambda... — Sade, May 19 '21 at 08:19
Did you do the math on how many permutations there are if the sample is 100 items? You can not fit them into memory. — Klaus D., May 19 '21 at 08:19
It will not reach a 100- the maximum it should hit is under 10. Since there are only 10 items a customer can buy - therefore having 10 indices. — Sade, May 19 '21 at 08:21
I am currently trying out joblib so that I can run on multiple cpus. — Sade, May 19 '21 at 08:57
temp_indices = Parallel(n_jobs=2)(delayed(func)(data_) for data_ in cust_indices) Seems to run with errors and seems to produces the same outputs. I have to test this on AWS to confirm if it runs on multiple CPUs. — Sade, May 19 '21 at 09:08
Currently experiencing the following issue: joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)} — Sade, May 19 '21 at 10:31
When stepping through func, the PC stalls at perm_ = pd.DataFrame(itertools.permutations(u_data)) — Sade, May 19 '21 at 12:29
According to https://stackoverflow.com/questions/104420/how-to-generate-all-permutations-of-a-list, @Boris Gorelik mentions: This and other recursive solutions have a potential hazard of eating up all the RAM if the permutated list is big enough. They also reach the recursion limit (and die) with large lists. This is what I am experiencing. — Sade, May 19 '21 at 12:42
After detecting the line of code at which the code crashes, I have updated the question above and its title. — Sade, May 19 '21 at 12:47

Memory Error: When using itertools.permutations on large lists

0 Answers0