5

I have this dataset:

import pandas as pd
import itertools

A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']

df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)

The example output is like this:

   A  M       F
0   A  1    plus
1   A  1   minus
2   A  1  square
3   A  2    plus
4   A  2   minus
5   A  2  square

I want to pairwise comparison (jaccard similarity) of each row from this data frame, for example, comparing

A 1 plus and A 2 square and get the similarity value between those both set.

I have wrote a jaccard function:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Which is only work on set because I used intersection

I want the output like this (this expected result value is just random number):

    0     1     2     3     45
0  1.00  0.43  0.61  0.55  0.46
1  0.43  1.00  0.52  0.56  0.49
2  0.61  0.52  1.00  0.48  0.53
3  0.55  0.56  0.48  1.00  0.49
45  0.46  0.49  0.53  0.49  1.00

What is the best way to get the result of pairwise metrics?

Thank you,

user46543
  • 1,033
  • 4
  • 13
  • 23

2 Answers2

3

A full implementation of what you want can be found here:

series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))
Sebastian Mendez
  • 2,859
  • 14
  • 25
3

You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -

def jaccard_similarity_score(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']

fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])

This is already faster, but let's see if we can get even faster.


Using senderle's fast cartesian_product -

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(numpy.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  


i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))

y

array([[ 1. ,  0.5,  0.5,  0.5,  0.2,  0.2],
       [ 0.5,  1. ,  0.5,  0.2,  0.5,  0.2],
       [ 0.5,  0.5,  1. ,  0.2,  0.2,  0.5],
       [ 0.5,  0.2,  0.2,  1. ,  0.5,  0.5],
       [ 0.2,  0.5,  0.2,  0.5,  1. ,  0.5],
       [ 0.2,  0.2,  0.5,  0.5,  0.5,  1. ]])
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Definitely a far better solution, +1. Still, you can't deny the simplicity of the nested `apply`s, and it *should* only be a constant factor slower. Also, I edited my answer to use `frozenset`, completely forgot about that. – Sebastian Mendez Nov 29 '17 at 05:20
  • @Sebastian Admittedly yes, but I'm betting that constant factor is pretty big, and you should see the difference for moderately sized inputs (yes, for large inputs, this ends up becoming slow due to the combinatorial nature of the problem). – cs95 Nov 29 '17 at 05:25
  • Also, I'm not sure what's the computationally expensive part of this, but assuming it's applying `fnc`, you could reduce the time by about a factor of two by only applying it to the upper triangle. – Sebastian Mendez Nov 29 '17 at 05:29
  • @Sebastian That, and the function itself being inherently slow. What's more, working with columns of frozensets offer 0 benefits in terms of performance, as they are objects. This is the nature of OP's input. Yes, computing the upper triangle only should offer some more speed gain. – cs95 Nov 29 '17 at 05:33