1

I have two lists of words, say,

list 1 : future proof list 2 : house past foo bar

I would like to calculate the semantic distance between each word of list 1 with each word of list 2. Fasttext has a nice function to display the nearest neighbours but it would be nice if there was a way to read the semantic distance between two defined words out. Can anyone help, please?

Thanks

alvas
  • 115,346
  • 109
  • 446
  • 738
laloca
  • 11
  • 2

1 Answers1

0

Unfortunately, there's no direct usage of word similarity functions in NLTK, although there are support for synset similarities through the WordNet API in NLTK.

Though not exhaustive, here's a list of pre-trained word embeddings that can be used to find out cosine similarity of word vectors: https://github.com/alvations/vegetables

To use, here's an example of using the HLBL Embeddings (from Turian et al. 2011) https://www.kaggle.com/alvations/vegetables-hlbl-embeddings (scroll down to the data explorer and download the directory directly, the top download button on the dataset page seem to lead to some corrupted data).

After downloading, you can load the embeddings using numpy:

>>> import pickle 
>>> import numpy as np

>>> embeddings = np.load('hlbl.rcv1.original.50d.npy')
>>> tokens = [line.strip() for line in open('hlbl.rcv1.original.50d.txt')]
>>> embeddings[tokens.index('hello')]
array([-0.21167406, -0.04189226,  0.22745571, -0.09330438,  0.13239339,
        0.25136262, -0.01908735, -0.02557277,  0.0029353 , -0.06194451,
       -0.22384156,  0.04584747,  0.03227248, -0.13708033,  0.17901117,
       -0.01664691,  0.09400477,  0.06688628, -0.09019949, -0.06918809,
        0.08437972, -0.01485273, -0.12062263,  0.05024147, -0.00416972,
        0.04466985, -0.05316647,  0.00998635, -0.03696947,  0.10502578,
       -0.00190554,  0.03435732, -0.05715087, -0.06777468, -0.11803425,
        0.17845355,  0.18688948, -0.07509124, -0.16089943,  0.0396672 ,
       -0.05162677, -0.12486628, -0.03870481,  0.0928738 ,  0.06197058,
       -0.14603543,  0.04026282,  0.14052328,  0.1085517 , -0.15121481])

To compute similarity of two numpy array, you can try Cosine Similarity between 2 Number Lists

import numpy as np

cos_similarity = lambda a, b: np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

x, y = np.array([1,2,3]), np.array([2,2,1])
cos_similarity(x,y)
alvas
  • 115,346
  • 109
  • 446
  • 738