0

I am using pool.map in multiprocessing to do my custom function,

def my_func(data): #This is just a dummy function.
   data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
   return data

def main():
    mypool=pool.Pool(processes=16,maxtasksperchild=100)
    ret_list=mypool.map(my_func,(group for name, group in gpd))
    mypool.close()
    mypool.join()
    result = pd.concat(ret_list, axis=0)

Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.

enter image description here

As I can see from here, VIRT increase to multiple fold and leads to this error.

Two questions,

  1. How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
  2. Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?
ds_user
  • 2,139
  • 4
  • 36
  • 71
  • Do not use complex data structures when passing data between processes as Python needs to pickle & unpickle all that data and re-create everything on both sides which, when using behemotesque modules like `pandas` can take a significant amount of resources. Also, do use iterators/generators when passing data so you don't have to map the whole data set in memory before you even start (see [this answer](https://stackoverflow.com/a/44502827/7553525) for an example) – zwer Jun 23 '17 at 08:31
  • Yeah but its a multi index data frame, I need to make sure of the keys, columns, etc from it in my function. Also, the data frame size is not that big, its the processing which takes much memory. – ds_user Jun 23 '17 at 08:51

1 Answers1

0

Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

ds_user
  • 2,139
  • 4
  • 36
  • 71