I have a function that finds similarity between columns of two dataframes:
def jac_sim_df(df1, df2, thresh):
L = []
for col in df1.columns:
js_list = []
genes1 = df1.loc[df1[col] >= 2,:].index #get DEGs for each column in df1
for column in df2.columns:
genes2 = df2.loc[df2[column] >= thresh,:].index #get genes with values higher than a threshold
js = jaccard_similarity(genes1, genes2) #calculate jaccard similarity for genes1 and genes2
js_list.append(js)
L.append(js_list)
df = pd.DataFrame(L)
return(df)
I want to vary threshold to see how it can affect the similarity between two dataframes.
Is there a way to apply this function to two dataframes df1 and df2 and a list of thresholds?
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 14)), columns=range(1,15))
df2 = pd.DataFrame(np.random.rand(100, 14), columns=range(1,15))
Thresholds values can be like this:
thresh = [x / 1000 for x in range(1, 10)]
jaccard_similarity function:
def jaccard_similarity(list1, list2):
s1 = set(list1)
s2 = set(list2)
return float(len(s1.intersection(s2)) / len(s1.union(s2)))
the outcome should be multiple dataframes df, number of dfs = number of threshold values