1

The code below created a dictionary of permutations, on a list of arrays of customer ids/indices.

Test code can be run to generate a basic sample set:

import numpy as np
import pandas as pd
import itertools   

def func(u_data):
    perm_ = pd.DataFrame(itertools.permutations(u_data))         
    p_ = perm_.set_index(perm_.shape[1]-1).to_dict()
    return p_


if __name__ == "__main__":
    cust_indices =[np.array([90,91]),np.array([100,101]),np.array([68,69])]
    temp_indices = []
    temp_indices = list(map(lambda i: func(i), cust_indices))

When cust_indices increases in the number of elements in array, the code is killed on EC2 AWS (i.e. it causes a OOM/ memory error).

The code crashes at perm_ = pd.DataFrame(itertools.permutations(u_data)) when cust_indices =[np.array([90,91]),np.array([100,101]),np.array([68,69]),np.array([1234372, 1234373, 1234374, 1234375, 1234376, 1234377, 1234378,1234379, 1234380, 1234381, 1234382, 1234383, 1234384, 1234385])]

I am currently trying to optimize the code to cater for a larger dataset and prevent an OOM error by either using multi-processing or by updating line perm_ = pd.DataFrame(itertools.permutations(u_data)).

Sade
  • 450
  • 7
  • 27
  • Your permutations are too many to store them in memory. – Klaus D. May 19 '21 at 08:09
  • So I original used a different method to generate the temp_indices. However, this method affected the speed. To improve the speed I used list(map(lambda i: func(i), cust_indices)). It is now resulting in a crash. Would using joblib to run on multiple cores solve this issue?? – Sade May 19 '21 at 08:13
  • When it crashed on AWS, it states killed at temp_indices. So you are correct. In addition, when I use htop only CPU is being utilised and it hits red at 100%. – Sade May 19 '21 at 08:15
  • There might be a better way of doing this than list(map(lambda ... – Sade May 19 '21 at 08:16
  • Maybe I should convert cust_indices from list of array so that I am not limited to using list(map(lambda... – Sade May 19 '21 at 08:19
  • Did you do the math on how many permutations there are if the sample is 100 items? You can not fit them into memory. – Klaus D. May 19 '21 at 08:19
  • It will not reach a 100- the maximum it should hit is under 10. Since there are only 10 items a customer can buy - therefore having 10 indices. – Sade May 19 '21 at 08:21
  • I am currently trying out joblib so that I can run on multiple cpus. – Sade May 19 '21 at 08:57
  • temp_indices = Parallel(n_jobs=2)(delayed(func)(data_) for data_ in cust_indices) Seems to run with errors and seems to produces the same outputs. I have to test this on AWS to confirm if it runs on multiple CPUs. – Sade May 19 '21 at 09:08
  • Currently experiencing the following issue: joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)} – Sade May 19 '21 at 10:31
  • When stepping through func, the PC stalls at perm_ = pd.DataFrame(itertools.permutations(u_data)) – Sade May 19 '21 at 12:29
  • According to https://stackoverflow.com/questions/104420/how-to-generate-all-permutations-of-a-list, @Boris Gorelik mentions: This and other recursive solutions have a potential hazard of eating up all the RAM if the permutated list is big enough. They also reach the recursion limit (and die) with large lists. This is what I am experiencing. – Sade May 19 '21 at 12:42
  • After detecting the line of code at which the code crashes, I have updated the question above and its title. – Sade May 19 '21 at 12:47

0 Answers0