I have a dask dataframe
grouped by the index (first_name
).
import pandas as pd
import numpy as np
from multiprocessing import cpu_count
from dask import dataframe as dd
from dask.multiprocessing import get
from dask.distributed import Client
NCORES = cpu_count()
client = Client()
entities = pd.DataFrame({'first_name':['Jake','John','Danae','Beatriz', 'Jacke', 'Jon'],'last_name': ['Del Toro', 'Foster', 'Smith', 'Patterson', 'Toro', 'Froster'], 'ID':['X','U','X','Y', '12','13']})
df = dd.from_pandas(entities, npartitions=NCORES)
df = client.persist(df.set_index('first_name'))
(Obviously entities
in the real life is several thousand rows)
I want to apply a user defined function to each grouped dataframe. I want to compare each row with all the other rows in the group (something similar to Pandas compare each row with all rows in data frame and save results in list for each row).
The following is the function that I try to apply:
def contraster(x, DF):
matches = DF.apply(lambda row: fuzz.partial_ratio(row['last_name'], x) >= 50, axis = 1)
return [i for i, x in enumerate(matches) if x]
For the test entities
data frame, you could apply the function as usual:
entities.apply(lambda row: contraster(row['last_name'], entities), axis =1)
And the expected result is:
Out[35]:
0 [0, 4]
1 [1, 5]
2 [2]
3 [3]
4 [0, 4]
5 [1, 5]
dtype: object
When entities
is huge, the solution is use dask
. Note that DF
in the contraster
function must be the groupped dataframe.
I am trying to use the following:
df.groupby('first_name').apply(func=contraster, args=????)
But How should I specify the grouped dataframe (i.e. DF
in contraster
?)