0

Here is my current code for Jaccard comparison, description included inside. I have a feeling that converting to numpy array and vectorizing might speed things up, but I am not sure how best to do this. Just an aside, many of the values in the output array will be 0, meaning the output is a sparse matrix.

import numpy as np
#values of list1 can be anywhere between 1-25,000,000 (not all values are included) 
#I want to perform a jaccard comparison pairwise for each row of list1
list1=[[123123,34566,4634,3422],[236564,8543525,234234],
          [2356574,3453,23423,2342,234]...[12312,32523,345,345345234]]

#currently my code looks like this (and is quite slow for large list sizes):

def jaccard(x,y):

    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

def returnJaccard(cids):
    lenList = len(cids)
    jarr = np.empty([lenList,lenList])
    for ix in range(lenList):
        for jx in range(lenList):
            if(ix>jx):
                jc = jaccard(cids[ix],cids[jx])
                jarr[ix][jx] = jc
                jarr[jx][ix] = jc
    return jarr

#output is an n x n matrix where n = len(list1), all values should be between 0 and 1
jaccard_compare = returnJaccard(list1)
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
jowparks
  • 33
  • 5
  • 1
    most of us have been out of school for quite a while ... you will have to give us a better problem description for help ... (ie example input,desired output, actual output) – Joran Beasley May 25 '17 at 23:27
  • `scipy.spatial.distance` has many of these metrics implemented, including Jaccard. See the use of `pdist` here (input doesn't have to be pandas): https://stackoverflow.com/questions/35639571/python-pandas-distance-matrix-using-jaccard-similarity – Niels Joaquin May 26 '17 at 02:19

0 Answers0