Looking for an elegant solution that avoid merging two dataframes

Question

I have a dask dataframe df that looks as follows:

Main_Author PaperID
A           X
B           Y
C           Z

I also have another dask dataframe pa that looks as follows:

PaperID  Co_Author
X        D
X        E
X        F
Y        A
Z        B
Z        D

I want a resulting dataframe that looks as follows:

Main_Author  Co_Authors   Num_Co_Authors
A            (D,E,F)      3
B            (A)          1
C            (B,D)        2

This is what I did:

df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))

This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?

score 2 · Accepted Answer · answered Mar 08 '19 at 21:26

If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed

there's a terrific example of dask.delayed here in the Dask docs or here on SO
see Dask use cases here

Imports

from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()

Generate dummy data in order to get a large number of rows in each DataFrame

Specify number of rows of dummy data to generate in each DataFrame

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

Generate some big dataset using the faker library (per this SO post)

def create_rows(auth_colname, num=1):
    output = [{auth_colname:fake.name(),
               "PaperID":random.randint(1000,2000)} for x in range(num)]
    return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

Print first 5 rows of dataframes

print(df.head())
       Main_Author  PaperID
0   Kyle Morton MD     1522
1    April Edwards     1992
2  Rachel Sullivan     1874
3    Kevin Johnson     1909
4     Julie Morton     1635

print(pa.head())
        Co_Author  PaperID
0  Deborah Cuevas     1911
1     Melissa Fox     1095
2    Sean Mcguire     1620
3     Cory Clarke     1424
4     David White     1569

Wrap the merge operation in a helper function

def merge_operations(df1, df2):
    df = df1.merge(df2, on="PaperID")
    df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
    df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
    return df

Dask approach - Generate final DataFrame using dask.delayed

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
    df_dask = dask.compute(ddf)

Output of Dask approach

[                                        ] | 0% Completed |  0.0s
[                                        ] | 0% Completed |  0.1s
[                                        ] | 0% Completed |  0.2s
[                                        ] | 0% Completed |  0.3s
[                                        ] | 0% Completed |  0.4s
[                                        ] | 0% Completed |  0.5s
[########################################] | 100% Completed |  0.6s

print(df_dask[0].head())
      Main_Author                                          Co_Author  Num_Co_Authors
0  Aaron Anderson  (Elizabeth Peterson, Harry Gregory, Catherine ...              15
1    Aaron Barron  (Martha Neal, James Walton, Amanda Wright, Sus...              11
2      Aaron Bond  (Theresa Lawson, John Riley, Daniel Moore, Mrs...               6
3  Aaron Campbell  (Jim Martin, Nicholas Stanley, Douglas Berry, ...              11
4  Aaron Castillo  (Kevin Young, Patricia Gallegos, Tricia May, M...               6

Pandas approach - Generate final DataFrame created using Pandas

df_pandas = (merge_operations)(df, pa)

print(df_pandas.head())
      Main_Author                                          Co_Author  Num_Co_Authors
0  Aaron Anderson  (Elizabeth Peterson, Harry Gregory, Catherine ...              15
1    Aaron Barron  (Martha Neal, James Walton, Amanda Wright, Sus...              11
2      Aaron Bond  (Theresa Lawson, John Riley, Daniel Moore, Mrs...               6
3  Aaron Campbell  (Jim Martin, Nicholas Stanley, Douglas Berry, ...              11
4  Aaron Castillo  (Kevin Young, Patricia Gallegos, Tricia May, M...               6

Compare DataFrames obtained using Pandas and Dask approaches

from pandas.util.testing import assert_frame_equal
try:
    assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
    message = "\n"+str(e)
else:
    message = 'DataFrames created using Dask and Pandas are equivalent.'

Result of comparing two approaches

print(message)
DataFrames created using Dask and Pandas are equivalent.

Looking for an elegant solution that avoid merging two dataframes

1 Answers1