Here is my current code for Jaccard comparison, description included inside. I have a feeling that converting to numpy array and vectorizing might speed things up, but I am not sure how best to do this. Just an aside, many of the values in the output array will be 0, meaning the output is a sparse matrix.
import numpy as np
#values of list1 can be anywhere between 1-25,000,000 (not all values are included)
#I want to perform a jaccard comparison pairwise for each row of list1
list1=[[123123,34566,4634,3422],[236564,8543525,234234],
[2356574,3453,23423,2342,234]...[12312,32523,345,345345234]]
#currently my code looks like this (and is quite slow for large list sizes):
def jaccard(x,y):
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
def returnJaccard(cids):
lenList = len(cids)
jarr = np.empty([lenList,lenList])
for ix in range(lenList):
for jx in range(lenList):
if(ix>jx):
jc = jaccard(cids[ix],cids[jx])
jarr[ix][jx] = jc
jarr[jx][ix] = jc
return jarr
#output is an n x n matrix where n = len(list1), all values should be between 0 and 1
jaccard_compare = returnJaccard(list1)