I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
- How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
- Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?