0

I would like to group keys in a dictionary based on their respective similarity. I want to look for similarity within different keys, and if they are similar enough, group them. Probably by using some sort of similarity score. I am thus specifically not interested in how they values within those dictionary match up (in the example below I kept them the same). I have been looking at similarity scores using sklearn cosine_similarity, but I could not find a way to apply this to keys in a dictionary. Anyone any clues on this?

I made a test dictionary to show what I mean. Some keys are very similar, and I would like to group those. How to group those is beyond the point now, but let's say I would like to add the numbers up.

As always, many thanks!

from sklearn.metrics.pairwise import cosine_similarity

dictionary = {'United States': {'population': 350, 'Continent': 'North America'},
              'united states': {'population': 350, 'Continent': 'North America'},
              'the United States of America': {'population': 350, 'Continent': 'North America'},
              'USA': {'population': 350, 'Continent': 'North America'},
              'Netherlands': {'population': 17, 'Continent': 'Europe'},
              'the Netherlands': {'population': 17, 'Continent': 'Europe'},
              'Japan': {'population': 160, 'Continent': 'Japan'}
              }
          
CrossLord
  • 574
  • 4
  • 20

1 Answers1

1

You can't calculate cosine similarity between strings. You can either calculate the pairwise string distance and cluster on that or using tf-idf on character n-grams, see this post for a similar discussion. In your case, we can try this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer(analyzer='char_wb',ngram_range=(2,2))
df = pd.DataFrame(dictionary).transpose()
mat = tfidf_vectorizer.fit_transform(df.index)
cl = AgglomerativeClustering(4).fit(cosine_similarity(mat))
df['label'] = cl.labels_

                             population      Continent  label
United States                       350  North America      0
united states                       350  North America      0
the United States of America        350  North America      0
USA                                 350  North America      3
Netherlands                          17         Europe      2
the Netherlands                      17         Europe      2
Japan                               160          Japan      1

I think it's not so easy to group USA and other United States together.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I was just starting looking into this direction. This is a great example of how this can be done, so thanks! Btw, I don't really like that you have to define the amount of groups you will split the data into. I created this example as a simplification, but I'm actually doing this for a string dataset of >2000 strings. – CrossLord Jun 15 '21 at 11:29
  • I also read a lot about the FuzzyWuzzy python library. Anyone any clues on how @StupidWolf's example compares to this, pro's and con's etc? – CrossLord Jun 15 '21 at 11:29
  • @CrossLord, you can also cut the tree at a certain height https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.cut_tree.html . And it all depends on how similar your strings are, for fuzzywuzzy or any distance based approaches, you have to account for the pairwise calculations – StupidWolf Jun 15 '21 at 16:40