Similar to the pandas GroupBy to List post, we are trying to run this process in dask
.
Our current solution implements the dataframe.apply function. Since this is a bottle neck in our process - are there any other options?
Bellow is a sample code using the dask.datasets.timeseries
data.
import dask
import dask.dataframe as dd
import pandas as pd
def set_list_att2(x: dd.Series):
return list(set([item for item in x.values]))
df = dask.datasets.timeseries()
df_gb = df.groupby(df.name)
gp_col = ['x','y' ,'id']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att2,
meta=pd.Series(dtype='object', name=f'{att_col_gr}_att'))
for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
df_edge_att = df_edge_att.join(ser.compute().to_frame(), how='left')
df_edge_att.head()
Note in the line
df_edge_att = df_edge_att.join(ser.compute().to_frame(), how='left')
we added the compute
other wise the sample code returned only 1 row in the final dataframe.