python multiprocessing subprocess - high VIRT usage leads to memory error

Question

I am using pool.map in multiprocessing to do my custom function,

def my_func(data): #This is just a dummy function.
   data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
   return data

def main():
    mypool=pool.Pool(processes=16,maxtasksperchild=100)
    ret_list=mypool.map(my_func,(group for name, group in gpd))
    mypool.close()
    mypool.join()
    result = pd.concat(ret_list, axis=0)

Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.

As I can see from here, VIRT increase to multiple fold and leads to this error.

Two questions,

How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?

Do not use complex data structures when passing data between processes as Python needs to pickle & unpickle all that data and re-create everything on both sides which, when using behemotesque modules like `pandas` can take a significant amount of resources. Also, do use iterators/generators when passing data so you don't have to map the whole data set in memory before you even start (see [this answer](https://stackoverflow.com/a/44502827/7553525) for an example) — zwer, Jun 23 '17 at 08:31
Yeah but its a multi index data frame, I need to make sure of the keys, columns, etc from it in my function. Also, the data frame size is not that big, its the processing which takes much memory. — ds_user, Jun 23 '17 at 08:51

score 0 · Accepted Answer · answered Jul 18 '17 at 02:40

0

Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

answered Jul 18 '17 at 02:40

ds_user

2,139
4
36
71

python multiprocessing subprocess - high VIRT usage leads to memory error

1 Answers1