Parallelise operations on pandas dataframes with multiprocessing - what am I doing wrong?

Question

I am trying to find a way to parallelise certain operations on dataframes, especially those that cannot be vectorised. I have tested the code below, taken from http://www.racketracer.com/2016/07/06/pandas-in-parallel/ , but it doesn't work. No error message - quite simply, nothing happens. Debugging it, it seems the code gets stuck at df = pd.concat(pool.map(func, df_split)) , but without any error messages.

What am I doing wrong?

import timeit
import pandas as pd
import numpy as np
import seaborn as sns
import multiprocessing
from multiprocessing import Pool

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

def multiply_columns(data):
    data['length_of_word'] = data['species'].apply(lambda x: len(x))
    return data

num_partitions = 2 #number of partitions to split dataframe
num_cores = 2# multiprocessing.cpu_count() #number of cores on your machine

iris = pd.DataFrame(sns.load_dataset('iris'))
iris = parallelize_dataframe(iris, multiply_columns)

Is there a reason why you are not using e.g. `dask`? – Quickbeam2k1 Mar 20 '19 at 10:58 — Quickbeam2k1, Mar 20 '19 at 10:58

score 0 · Answer 1 · answered Mar 20 '19 at 10:43

0

I needed to add

if __name__ == "__main__":

answered Mar 20 '19 at 10:43

Pythonista anonymous

8,140
20
70
112

Please use the edit link on your question to add additional information. The Post Answer button should be used only for complete answers to the question. - [From Review](/review/low-quality-posts/22519096) – Amsakanna Mar 20 '19 at 11:02
I am not following. The complete answer to the question is that parallelize(dataframe) must be run only if __name=="__main__" , which is what I have written. I could have made it more explicit, but it seemed pretty obvious to me – Pythonista anonymous Mar 20 '19 at 11:04
1

Please include the lines before and after this `if`-statement you want to include. _(Always think of your posts here as entries in a knowledge base, not just a chat)_ – Dirk Horsten Mar 20 '19 at 11:10
This looked more like a comment saying that you forgot to add something to your question. Anyways cheers! – Amsakanna Mar 20 '19 at 11:22

Parallelise operations on pandas dataframes with multiprocessing - what am I doing wrong?

1 Answers1