1

How can I parallelize computations and store the results into a common dataframe? My function mutates an external dataframe df by inserting values.

import multiprocessing as mp
import numpy as np
import pandas as pd


def f(u): df.loc[u] = u**2

# unicore computation:
df = pd.DataFrame(np.zeros(10), index=range(10))
[f(u) for u in range(10)]
print(df.T)
# gives correct result
#      0    1    2    3     4     5     6     7     8     9
# 0  0.0  1.0  4.0  9.0  16.0  25.0  36.0  49.0  64.0  81.0

# multicore computation:
df = pd.DataFrame(np.zeros(10), index=range(10))
pool = mp.Pool(2)
pool.map(f, range(10))
pool.close()
print(df.T)
# gives wrong result
#      0    1    2    3    4    5    6    7    8    9
# 0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

It is easy to achieve the same task in Julia:

using DataFrames

function f(u) df[u,:v]=u^2 end

df = DataFrame(v=zeros(10));
Threads.@threads for u=1:10 f(u) end

I was hoping a simple solution like this would also be possible in Python.

Leo
  • 13
  • 3
  • 2
    Memory is not shared across sub-processes. You could probably do this with multithreading but a) you might need to protect/lock the dataframe and b) it would probably defeat the object which is presumably to spread the computational load which multithreading isn't good for – DarkKnight May 15 '23 at 15:43
  • Hmm, in Julia, this works fine (no locking necessary). Even in Python, my actual case has a large input dataframe DF (global variable). In each parallel call, my function f queries the DF, processes the subframe, and returns the result (this already works). I just wish to store the result into my global df, not return it. Do you have a suggestion how to do this? – Leo May 15 '23 at 15:46
  • You may find this useful: https://stackoverflow.com/questions/59844783/sharing-python-objects-e-g-pandas-dataframe-between-independently-running-pyt – DarkKnight May 15 '23 at 15:52
  • Is this a single numeric data type? A numpy array in shared memory may work best for you. – tdelaney May 15 '23 at 19:46

1 Answers1

0

You are using a copy-on-write operating system and the assignment df.loc[u] = u**2 is a write that goes to a copy of the memory only seen by the child process. You could do the bulk of the work in the pool worker and pass the result back to the parent for assignment

import multiprocessing as mp
import numpy as np
import pandas as pd

def f(u): 
    return u, u**2

# multicore computation:
df = pd.DataFrame(np.zeros(10), index=range(10))
with mp.Pool(2) as pool:
    for u, result in pool.imap(f, range(10)):
        df.loc[u] = result
print(df.T)

This adds the cost of sending more return data back to the parent process but is reasonable if the amount of work done in the worker is large by comparison.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Thank you, this is what I was looking for. Also, I'd appreciate any upvotes, so I can gather 15 points, to be able to upvote posts. – Leo May 16 '23 at 06:25