Convert dataframe rows to Python set

Question

I have this dataset:

import pandas as pd
import itertools

A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']

df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)

The example output is like this:

   A  M       F
0   A  1    plus
1   A  1   minus
2   A  1  square
3   A  2    plus
4   A  2   minus
5   A  2  square

I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

A 1 plus and A 2 square and get the similarity value between those both set.

I have wrote a jaccard function:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Which is only work on set because I used intersection

I want the output like this (this expected result value is just random number):

    0     1     2     3     45
0  1.00  0.43  0.61  0.55  0.46
1  0.43  1.00  0.52  0.56  0.49
2  0.61  0.52  1.00  0.48  0.53
3  0.55  0.56  0.48  1.00  0.49
45  0.46  0.49  0.53  0.49  1.00

What is the best way to get the result of pairwise metrics?

Thank you,

What do you mean by "it does not work"? With your given `df`, `set(df.loc[0])` evaluates to `{'1', 'A', 'plus'}` as expected. — Sebastian Mendez, Nov 29 '17 at 04:35
Well that's because you have`[[0]]` instead of `[0]`, which will return a dataframe object instead of a series object. Since it's a dataframe, `set` will return the column values. — Sebastian Mendez, Nov 29 '17 at 04:39
Ok, thank you @Sebastian I just realize that I use double bracket — user46543, Nov 29 '17 at 04:43
How do you calculate 0.43 for rows 0 & 1? Two of the three items intersect, so shouldn't it be 2 / (3 + 3 - 2) = 0.5? — Alexander, Nov 29 '17 at 04:48
You are correct @Alexander, when the jaccard function is ran as in my answer, it will return `0.5` for those rows. — Sebastian Mendez, Nov 29 '17 at 04:50
Sorry @Alexander those values just random value, not the jaccard result — user46543, Nov 29 '17 at 04:50

Sebastian Mendez · Accepted Answer · 2017-11-29T05:24:37.603

3

A full implementation of what you want can be found here:

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))

edited Nov 29 '17 at 05:24

answered Nov 29 '17 at 04:48

Sebastian Mendez

2,859
14
25

Thank you Sebastian, I have tried to get the result more than hour – user46543 Nov 29 '17 at 04:54
Honestly I didn't think it would be this easy, I've just started using `apply` for "good-enough" solutions. – Sebastian Mendez Nov 29 '17 at 05:01
Hi @Sebastian, is it possible to remove half of the result diagonally? Because it duplicates, right? – user46543 Nov 29 '17 at 05:02
Nested apply? Wow, this is going to _suffer_ on large inputs. – cs95 Nov 29 '17 at 05:04
For sure, but is there an easier way to take the Cartesian product of two Series and turn it into a DataFrame with a custom function? Or is there a better method to approach this? Like I said, this was my "good-enough" approach, but I'd love to see a more refined answer. – Sebastian Mendez Nov 29 '17 at 05:06
@user46543 see [this](https://stackoverflow.com/a/40690889/4418475) answer. – Sebastian Mendez Nov 29 '17 at 05:09
I invite you to do a timings comparison here: https://stackoverflow.com/a/47545491/4909087 – cs95 Nov 29 '17 at 05:15

score 3 · Answer 2 · answered Nov 29 '17 at 05:12

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)

y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get even faster.

Using senderle's fast cartesian_product -

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

Definitely a far better solution, +1. Still, you can't deny the simplicity of the nested `apply`s, and it *should* only be a constant factor slower. Also, I edited my answer to use `frozenset`, completely forgot about that. — Sebastian Mendez, Nov 29 '17 at 05:20
@Sebastian Admittedly yes, but I'm betting that constant factor is pretty big, and you should see the difference for moderately sized inputs (yes, for large inputs, this ends up becoming slow due to the combinatorial nature of the problem). — cs95, Nov 29 '17 at 05:25
Also, I'm not sure what's the computationally expensive part of this, but assuming it's applying `fnc`, you could reduce the time by about a factor of two by only applying it to the upper triangle. — Sebastian Mendez, Nov 29 '17 at 05:29
@Sebastian That, and the function itself being inherently slow. What's more, working with columns of frozensets offer 0 benefits in terms of performance, as they are objects. This is the nature of OP's input. Yes, computing the upper triangle only should offer some more speed gain. — cs95, Nov 29 '17 at 05:33

Convert dataframe rows to Python set

2 Answers2

Linked