4

I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.

def merge_dfs(df1, df2, columns):
    merged = pd.merge(df1, df2, on=columns, how='inner')
...

How can I change this line to match to dask dataframes?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34

1 Answers1

5

The dask merge follows pandas syntax, so it's just substituting call to pandas with a call to dask.dataframe:

import dask.dataframe as dd

def merge_dfs(df1, df2, columns):
    merged = dd.merge(df1, df2, on=columns, how='inner')
# ...

The resulting dataframe, merged, will be a dask.dataframe and hence may need computation downstream. This will be done automatically if you are persisting the data to a file, e.g. with .to_csv or with .to_parquet.

If you will need the dataframe for some computation and if the data fits into memory, then calling .compute will create a pandas dataframe:

pandas_df = merged.compute()
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46