If you are looking to work with two large DataFrame
s, then you could try to wrap this merge
in dask.delayed
Imports
from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
Generate dummy data in order to get a large number of rows in each DataFrame
- Specify number of rows of dummy data to generate in each
DataFrame
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
Generate some big dataset using the faker
library (per this SO post)
def create_rows(auth_colname, num=1):
output = [{auth_colname:fake.name(),
"PaperID":random.randint(1000,2000)} for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
Print first 5 rows of dataframes
print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
Wrap the merge
operation in a helper function
def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
Dask approach - Generate final DataFrame
using dask.delayed
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
Output of Dask approach
[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Pandas approach - Generate final DataFrame
created using Pandas
df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
Compare DataFrame
s obtained using Pandas and Dask approaches
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "\n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'
Result of comparing two approaches
print(message)
DataFrames created using Dask and Pandas are equivalent.