1

I have two sentences:

sent1="This work has been completed by Christopher Pan".
sent2="This job has been finished by Mark Spencer".

I calculated the similarity off sentences using Word2vec:

from sklearn.metrics.pairwise import cosine_similarity

def avg_sentence_vector(words, model, num_features, index2word_set):
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0

    for word in words:
        if word in index2word_set:
            nwords = nwords+1
            featureVec = np.add(featureVec, model[word])

    if nwords>0:
        featureVec = np.divide(featureVec, nwords)
    return featureVec

as follows:

sent1_avg_vector = avg_sentence_vector(sent1.split(), model=word2vec_model, num_features=100)
sent2_avg_vector = avg_sentence_vector(sent2.split(), model=word2vec_model, num_features=100)

sen1_sen2_similarity =  cosine_similarity(sent1_avg_vector, sent2_avg_vector)

I would like to know how I can build a semantic tree which can tell me that:

  • completed and finished are similar words;
  • work and job are similar words too;
  • then if I find work/job in the sentence or finished/completed, these words are both connected with Christopher and Mark.

I do not know technically if there is something in Python that can allow me to get such results. I would appreciate if you could guide me into the right direction.

Thanks

gojomo
  • 52,260
  • 14
  • 86
  • 115
still_learning
  • 776
  • 9
  • 32

2 Answers2

1

Using the average of all the word-vectors of words in a text is a quick & simple technique for creating a summary vector of the full text – but won't capture all the shades of meaning, especially those created by grammatical constructions, word-modifiers, or multiword-phrases.

Your word2vec model is likely to already reflect the fact that 'completed' and 'finished' are similar in meaning, or 'work' and 'job', by making their vectors similar. Simply comparing those word-vectors directly, and contrasting the result against comparisons with other word-vectors, will tell you relatively more- or less- similar word pairs or groups.

It's not clear what you mean by "these words are both connected with Christopher or Mark". A generic set of word-vectors might not have very meaningful vectors for 'Christopher' or 'Mark', as those are proper names with only local meaning, to denote a particular person, without strong associations with larger concepts. (As many word2vec training sets case-flatten words before training, it's possible there won't even be any vector for 'Christopher', capitalized, at all.)

You'd need to say a lot more about what you mean to achieve to know what to recommend. For example, you might need a tool for 'named-entity recognition' ('NER') to identify that 'Christopher Pan' and 'Mark Spencer' are discrete entities of interest, and other grammar-aware parsing or part-of-speech tagging to label them as entities related to some other verb/action.

gojomo
  • 52,260
  • 14
  • 86
  • 115
1

what you could do is create a similarity matrix between words

            word1   word2   word3   word4
    word1   0       1.1     2.2     3.3
    word2   1.1     0       2.2     3.3
    word3   1.1     2.2     0       3.3 
    word4   1.1     2.2     3.3     0

numbers for demonstration only

this can be done by two for loops

    for w1 in words:
        for w2 in words:
            matrix.append(similarety(w1,w2))

then iterate for each list of the matrix to retrieve the max value index using

    matrix[0].index(max(matrix[0]))

for instance, word 1 would return the max index of 3 for the value 3.3, therefore, word1 is mostly similar to word4