How to perform on global dataframe in the target function of multiprocessing in python?

Question

I have the following code. I want to calculate values of all pairs using calculate_mi function on global dataframe df with python multiprocess.

from multiprocess import Pool

def calculate_mi(pair):
  global df
  from pyitlib import discrete_random_variable as drv
  import numpy as np
  i, j = pair
  val = ( 2*drv.information_mutual(df[i].values.astype(np.int32), df[j].values.astype(np.int32)) ) / ( drv.entropy(df[i].values.astype(np.int32)) + drv.entropy(df[j].values.astype(np.int32)) )
  return (i,j), val

def calculate_value(t_df):
  global df
  df = t_df
  all_pair = [('1', '2'), ('1', '3'), ('2', '1'), ('2', '3'), ('3', '1'), ('3', '2')]

  pool = Pool()
  pair_value_list = pool.map(calculate_mi, all_pair)
  pool.close()
  print(pair_value_list)

def calc():
  data = {'1':[1, 0, 1, 1],
    '2':[0, 1, 1, 0],
    '3':[1, 1, 0, 1],
    '0':[0, 1, 0, 1] }

  t_df = pd.DataFrame(data)
  calculate_value(t_df)

if __name__ == '__main__':
  calc()

This code gives me the expected output in google colab platform. But it gives the following error while I run it in my Local machine. (I am using windows 10 ,anaconda, jupyter notebook,python 3.6.9). How can i solve this or is there another way to do it? RemoteTraceback Traceback (most recent call last), ... NameError: name 'df' is not defined

score 1 · Accepted Answer · answered Dec 15 '20 at 16:59

1

First, a couple of things:

It should be: from multiprocessing import Pool (not from multiprocess)
It appears you have left out the import of the pandas library.

Moving on ...

The problem is that under Windows the creation of new processes is not done using a fork call and consequently the sub-processes do not automatically inherit global variables such as df. Therefore, you must initialize each sub-process to have global variable df properly initialized by using an initializer when you create the Pool:

from multiprocessing import Pool
import pandas as pd

def calculate_mi(pair):
  global df
  from pyitlib import discrete_random_variable as drv
  import numpy as np
  i, j = pair
  val = ( 2*drv.information_mutual(df[i].values.astype(np.int32), df[j].values.astype(np.int32)) ) / ( drv.entropy(df[i].values.astype(np.int32)) + drv.entropy(df[j].values.astype(np.int32)) )
  return (i,j), val

# initialize global variable df for each sub-process
def initpool(t_df):
    global df
    df = t_df

def calculate_value(t_df):
  all_pair = [('1', '2'), ('1', '3'), ('2', '1'), ('2', '3'), ('3', '1'), ('3', '2')]

  # make sure each sub-process has global variable df properly initialized:    
  pool = Pool(initializer=initpool, initargs=(t_df,))
  pair_value_list = pool.map(calculate_mi, all_pair)
  pool.close()
  print(pair_value_list)

def calc():
  data = {'1':[1, 0, 1, 1],
    '2':[0, 1, 1, 0],
    '3':[1, 1, 0, 1],
    '0':[0, 1, 0, 1] }

  t_df = pd.DataFrame(data)
  calculate_value(t_df)

if __name__ == '__main__':
  calc()

answered Dec 15 '20 at 16:59

Booboo

38,656
3
37
60

As far I know, Ipython jupyter notebook does not support `multiprocessing` till now. But it supports `mutiprocess` module which is a fork of [multiprocessing] (https://pypi.org/project/multiprocess/), that's why I used it as I am working with jupyter notebook.This piece of code works fine but if I use `numpy` ,`drv` module outside of `calculate_mi` function it gives error that is "name 'drv' is not defined". Is there any way to import this once outside the `calculate_mi` function and use it by all child process?[N.B. I had placed them in `initpool` method, it still shows the same error] – Hasan Tarek Dec 16 '20 at 11:06
First, there is a [way to use `multiprocessing` in jupyter notebook](https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac). I've never had a problem with importing `numpy` globally using `multiprocessing`. If you are saying that is an issue with `multiprocess`, then switch to `multiprocessing` according to the link I've shown you or stop using jupyter notebook. – Booboo Dec 16 '20 at 11:12
And as an aside: If you don't use a context manager, i.e. `with Pool(...) as pool:`, which terminates the pool correctly, then after you call `pool.close()`, you should really call `pool.join()`. See: https://stackoverflow.com/questions/38271547/when-should-we-call-multiprocessing-pool-join – Booboo Dec 16 '20 at 11:35
What is the difference between `multiprocess` and `multiprocessing` module or is there any disadvantages of this module. Why do you suggest to use `multiprocessing` instead of `multiprocess`? – Hasan Tarek Dec 16 '20 at 11:56
I am not familiar with `multiprocess` and when I went to the `PyPi` repository to look it up I could not learn very much from the description.. So I suggested what is the *standard* and what I know *should* work. – Booboo Dec 16 '20 at 11:58

How to perform on global dataframe in the target function of multiprocessing in python?

1 Answers1