2

I created a global variable of pandas dataframe. I expected the child processes can access to the global dataframe, but it seems that the child process cannot get the global variable.

data = pd.DataFrame(data = np.array([[i for i in range(1000)] for j in range(500)]))

def get_sample(i):
    print("start round {}".format(i))
    sample = data.sample(500, random_state=i)
    xs = sample.sum(axis=0)
    if i < 10:
        print(data.shape())
        print(sample.iloc[:3, :3])
    print("rount{} returns output".format(i))
    return xs

samples = []
def collect(result):
    print("collect called with {}".format(result[0][0].shape))
    global samples
    samples.extend(result)

ntasks = 1000
if __name__=='__main__':
    samples = []
    xs = pd.DataFrame()
    """sampling"""
    pool = mp.Pool(cpu_count(logical=True))
    print("start sampling, total round = {}".format(ntasks))
    r = pool.map_async(get_sample, [j for j in range(ntasks)], callback=collect)
    r.wait()
    pool.close()
    pool.join()

    xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True)
    xs = xs.transpose()

    print("xs: ")
    print(xs.shape)
    print(xs.iloc[:10, :10])

The global dataframe is data. I expected in each child process, the function get_sample can access to data and retrieve some value from data. To make sure child process can get data, I print out the shape of data at each child process. the problem is that it seems the child process cannot get data, because when I run it, there's no print out of data's shape nor partial of sample.

Furthermore, I received error: Traceback (most recent call last): File "sampling2c.py", line 51, in xs = pd.concat([sample for sample in samples], axis = 1, ignore_index=True) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1451, in concat copy=copy) File "/usr/usc/python/3.6.0/lib/python3.6/site-packages/pandas/tools/merge.py", line 1484, in init raise ValueError('No objects to concatenate') it seems the get_sample function didn't return anything, the sampling failed.

However, when I did a experiment to test whether child processes can access to global variable, it works.

df = pd.DataFrame(data = {'a':[1,2,3], 'b':[2,4,6]})
df['c1'] = [1,2,1]
df['c2'] = [2,1,2]
df['c3'] = [3,4,4]

df2 = pd.DataFrame(data = {'a':[i for i in range(100)], 'b':[i for i in range(100, 200)]})
l = [1, 2, 3]
Mgr = Manager()
results = []
def collect(result):
    global results
    #print("collect called with {}".format(result))
    results.extend(result)

counter = 12
def sample(i):
    print(current_process())
    return df2.sample(5, random_state = i)

if __name__=='__main__':
    pool = Pool(3)
    r = pool.map_async(sample, [i for i in range(3)], callback = collect) #callback = collect
    r.wait()
for res in results:
    print(res)

Each child process can access to the global variable df2. I'm not sure why the child processes cannot access data in the first block of code.

Henry Bai
  • 367
  • 1
  • 3
  • 17
  • Possible duplicate of [multiprocessing global variable updates not returned to parent](https://stackoverflow.com/questions/11055303/multiprocessing-global-variable-updates-not-returned-to-parent) –  Aug 04 '19 at 04:46
  • Multiprocessing spawns new processes with their own new global according to existing answers https://stackoverflow.com/questions/659865/multiprocessing-sharing-a-large-read-only-object-between-processes –  Aug 04 '19 at 04:49
  • the child process just has a virtual copy of every variable. changes made by either the parent or the child will not be seen by the other. – Skaperen Aug 04 '19 at 06:35

1 Answers1

1

When you spawn a process using multiprocessing, your new process gets a copy of the state at the time of spawning.

If you want to communicate data between your parent process or other sibling processes, you can do so using shared variables or a server process that handles shared objects. For details, see sharing-state-between-processes

If you instead use threading, the individual threads all run in the same context, sharing all global variables. So you can access all global variables in all threads and the main loop without having to do anything special.

Both, threading and multiprocessing, have their advantages and disadvantages, but this is not the place to discuss these.

NZD
  • 1,780
  • 2
  • 20
  • 29
  • Thanks for your response. I have tried shared variables method, I used manager.namespace to share the dataframe. However, shared variable has size limits. I need to share a dataframe which is over 3 gigabytes. When I put it to shared variable, the error occurs "struct.error: 'i' format requires -2147483648 <= number <= 2147483647. – Henry Bai Aug 05 '19 at 01:38
  • Further threading, it make sense, but the amount of threads of each core is limited, which cannot speed up effectively. – Henry Bai Aug 05 '19 at 01:46
  • Could you tell me more about how to set up a server process please? Which library and methods to use to create a server process? – Henry Bai Aug 05 '19 at 01:47
  • @HenryBai I never used it myself (I normally use queues, because I work with sockets and serial ports), but the multiprocessing link I gave you contains an example for a server process (just below the shared memory entry). – NZD Aug 06 '19 at 02:31