I have some SAS coding that I am trying to convert to Python. I am having difficulties calculating the jaccard distance on asymmetric data – where the zeros should be ignored in the calculation. I do find some examples on jaccard but they do not calculate the asymmetric distance. Just checking to see if a library has this available before I try to reinvent the wheel. If someone could please steer me in the right direction, I would really appreciate it.
My test dataset contains 5 headers and 5 rows
H0 H1 H2 H3 H4
A 1 1 1 1 0
B 1 0 1 1 0
C 1 1 1 1 0
D 0 0 1 1 1
E 1 1 0 1 0
below is the expected result(distance) calculated by shorthand and also from using SAS:
. | A | B | C | D | E
A | 0 | 0.25| 0 | 0.6 | 0.25
B | 0.25| 0 | 0.25| 0.5 | 0.5
C | 0 | 0.25| 0 | 0.6 | 0.25
D | 0.6 | 0.5 | 0.6 | 0 | 0.8
E | 0.25| 0.5 | 0.25| 0.8 | 0
But, using jaccard in python, I get results like:
. |A | B | C | D | E
A |1.00 | 0.43 | 0.61 | 0.55 | 0.46
B |0.43 | 1.00 | 0.52 | 0.56 | 0.49
C |0.61 | 0.52 | 1.00 | 0.48 | 0.53
D |0.55 | 0.56 | 0.48 | 1.00 | 0.49
E |0.46 | 0.49 | 0.53 | 0.49 | 1.00
Below is the code I experimented on. I am new to Python so I might be making an obvious mistake. I have added the SAS code at the bottom in case someone would like it for reference:
Python Code:
np.random.seed(0)
df = pd.DataFrame(np.random.binomial(1, 0.5, size=(100, 5)),
columns=list('ABCDE'))
print(df.head())
jac_sim = 1 - pairwise_distances(df.T, metric = "jaccard")
jac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)
import itertools
sim_df = pd.DataFrame(np.ones((5, 5)), index=df.columns, columns=df.columns)
for col_pair in itertools.combinations(df.columns, 2):
sim_df.loc[col_pair] = sim_df.loc[tuple(reversed(col_pair))] =
jaccard_similarity_score(df[col_pair[0]], df[col_pair[1]])
print(sim_df)
SAS Code:
proc import datafile = '/home/xxx/xxx.csv'
out = work.Binary2 replace
dbms = CSV;
GUESSINGROWS=MAX;
run;
proc sort;
by VAR1;
run;
title ’Data Clustering of BN’;
proc distance data=Binary2 method=djaccard absent=0 out=distjacc;
var anominal (r0--r4);
id VAR1;
run;