2

I have a very large dataframe that I'm handling with dask. The dataframe looks by and large like this:

Col_1    Col_2   Bool_1   Bool_2
A        1       True     False
B        1       True     True
C        1       False    False
D        1       True     False
A        2       False    True
B        2       False    False
C        2       True     False
D        2       True     True

But it has millions of rows.

What I'm trying to do at this point of the code is to calculate a Jaccard distance between Bool_1 and Bool_2 for each group formed in Col_2. This, because the aim of this program is to produce one line for each group that is present in Col_2 (each line has several statistics, I'm reporting only the relevant columns).

To do so, I first group the dataframe by Col_2 using df.groupby("Col_2"), but then I don't know how to proceed. Every attempt I tried so far has thrown an error.

1: I tried to define a function compute_jacc_dist() and to pass it via apply(compute_jacc_dist, axis=1) to the groups, but it has issues with the args and kwargs (the axis especially, see https://github.com/dask/dask/issues/1572 , which I couldn't solve yet).

2: I tried to use from dask_distance import jaccard and use it to compute the J distance between Bool_1 and Bool_2 but it produces weird results (each group returns J=1 even if there is NO intersection).

3: I tried to compute() the dataframe and to iterate over the groups using:

for name, group in df.groupby("Col_2"):
   jacc = dask_distance.jaccard(group["Bool_1"], group["Bool_2"])

But this one is slow as hell because it triggers a computation and then operates over such a huge dataframe group by group (i.e. I don't want to use it). For reference, a script with this function is running since two days, while I estimate that any of the solutions #1 and #2 I have tried, if properly set, would return results in 1-2 hours.

Any suggestion on how I could handle this issue? My ideal solution would be to use df.groupby("Col_1").apply(compute_jacc_dist) in a proper way. Any help much appreciated!

schmat_90
  • 572
  • 3
  • 22

1 Answers1

6

After many hours of trying, here's how I did it. If you're reading this, you may wanna read this (How to apply euclidean distance function to a groupby object in pandas dataframe?) and this (Apply multiple functions to multiple groupby columns).

def my_function(x):

    d = {}
    v1 = np.array(x["Bool_1"])
    v2 = np.array(x["Bool_2"])
    intersection = np.logical_and(v1, v2).sum()
    union = np.logical_or(v1, v2).sum()
    d["Jaccard"] = float(intersection) / float(union)
    return pd.Series(d, index=["Jaccard"])

df = df.groupby("Col_2").apply(my_function, meta={"Jaccard":"float16"}).compute()

Explanation

I create a function that computes the Jaccard distance between the two columns of my dataframe. Within the function, I create a dictionary (d) which will contain the results of my computations.

A perk of having a dictionary is that I can add as many computations as I want, although here there is only one.

The function then returns a pd.Series containing the dictionary.

The function is applied to the dataframe groups, which are based on Col_2. meta data types are specified within apply(), and the whole thing has compute() at the end, since it's a dask dataframe and a computation must be triggered to get the result.

The apply() should have as many meta as there are output columns.

schmat_90
  • 572
  • 3
  • 22