1

Basically, I have a dict in Python with string keys and arrays of ints as values.

dict = {"Option1Results" : [4, 1, 5, 2, 4],
        "Option2Results" : [11, 44, 2, 1, 5],
        ....
        }

I would like to implement hierarchical clustering on this dict based on the intersection of the values. For example, let's say Option1Results and Option4Results share about 70% of the same integers, then cluster them together. Is there a way to go about this other than looping through the dictionary and comparing the values one by one?

Garrett Hyde
  • 5,409
  • 8
  • 49
  • 55
GiveEmMoZo
  • 69
  • 1
  • 3
  • 8
  • 2
    What do you mean by "cluster them together"? This is why it is highly recommended that you show your best attempt on SO, no matter how inefficient or crappy. Your code often makes explanations much easier. – Mad Physicist Jul 24 '17 at 20:20
  • dictionaries are not the right option of choice for this problem. – cs95 Jul 24 '17 at 20:21
  • Given `A` is `[1,2,3,4]` and `B` is `[2,3,4,5]` and `C` is `[4,5,6,7]`. By your criteria `A` and `B` would cluster, and `B` and `C` would cluster, but `A` and `C` would not cluster. How would you handle that? The [tag for hierarchical-clustering](https://stackoverflow.com/tags/hierarchical-clustering/info) mentions a number of clustering techniques. Have you picked one? Have you researched algorithms for your clustering technique? – Steven Rumbalski Jul 24 '17 at 20:30
  • You can use a simple `set` intersection to determine the elements your two lists have in common, i.e. `set(your_dict["Option1Results"]) & set(your_dict["Option4Results"])`. Then you can compare its length with the total `set` length to determine the percentage of elements they have in common (e.g. `float(len(set(entries["Option1Results"]) & set(entries["Option2Results"]))) / len(set(entries["Option1Results"]) | set(entries["Option2Results"])) * 100` ) – zwer Jul 24 '17 at 20:53

2 Answers2

0

I think you could utilize two functions cosine similarity and kmeans

cosine similarity:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
https://en.wikipedia.org/wiki/Cosine_similarity

data = {'Option{}Results'.format(i):[ random.randint(1,100) for _ in range(5)] for i in range(100)}
pairwise.cosine_similarity(data.values()[0],data.values()[1])
array([[ 0.85988428]])

kmeans:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. https://en.wikipedia.org/wiki/K-means_clustering

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=0).fit(data.values())
kmeans.predict(data['Option70Results'])
array([2])
galaxyan
  • 5,944
  • 2
  • 19
  • 43
0

To find the intersection of the values of the given dict as a set:

intersection = set.intersection(*map(set, dict.values())

Hierarchical clustering can be achieved using scipy's linkage and fcluster. Hierarchical clustering using scipy is explained by this answer.

kekkler
  • 326
  • 2
  • 4