I have a very large dataframe that I'm handling with dask. The dataframe looks by and large like this:
Col_1 Col_2 Bool_1 Bool_2
A 1 True False
B 1 True True
C 1 False False
D 1 True False
A 2 False True
B 2 False False
C 2 True False
D 2 True True
But it has millions of rows.
What I'm trying to do at this point of the code is to calculate a Jaccard distance between Bool_1
and Bool_2
for each group formed in Col_2
. This, because the aim of this program is to produce one line for each group that is present in Col_2
(each line has several statistics, I'm reporting only the relevant columns).
To do so, I first group the dataframe by Col_2
using df.groupby("Col_2")
, but then I don't know how to proceed. Every attempt I tried so far has thrown an error.
1: I tried to define a function compute_jacc_dist()
and to pass it via apply(compute_jacc_dist, axis=1)
to the groups, but it has issues with the args and kwargs (the axis especially, see https://github.com/dask/dask/issues/1572 , which I couldn't solve yet).
2: I tried to use from dask_distance import jaccard
and use it to compute the J distance between Bool_1
and Bool_2
but it produces weird results (each group returns J=1 even if there is NO intersection).
3: I tried to compute()
the dataframe and to iterate over the groups using:
for name, group in df.groupby("Col_2"):
jacc = dask_distance.jaccard(group["Bool_1"], group["Bool_2"])
But this one is slow as hell because it triggers a computation and then operates over such a huge dataframe group by group (i.e. I don't want to use it). For reference, a script with this function is running since two days, while I estimate that any of the solutions #1 and #2 I have tried, if properly set, would return results in 1-2 hours.
Any suggestion on how I could handle this issue? My ideal solution would be to use df.groupby("Col_1").apply(compute_jacc_dist)
in a proper way. Any help much appreciated!