I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.
Eg.
df = db.query("select id, a_lot_of_data from table")
def process(id):
temp_df = df.loc[id]
temp_df.apply(another_function)
Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())
Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)