3

I'm new to NLP but I'm trying to match a list of sentences to another list of sentences in Python based on their semantic similarity. For example,

list1 = ['what they ate for lunch', 'height in inches', 'subjectid']
list2 = ['food eaten two days ago', 'height in centimeters', 'id']

Based on previous posts and prior knowledge, it seemed the best way was to create document vectors of each sentence and compute the cosine similarity score between lists. Other posts I've found with regards to Doc2Vec, as well as the tutorial, seem focused on prediction. This post does a good job doing the calculation by hand, but I thought it was possible for Doc2Vec to do that already. The code I'm using is

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

def build_model(train_docs, test_docs, comp_docs):
    '''
    Parameters
    -----------
    train_docs: list of lists - combination of known both sentence list
    test_docs: list of lists - one of the sentence lists
    comp_docs: list of lists - combined sentence lists to match the index to the sentence 
    '''
    # Train model
    model = Doc2Vec(dm = 0, dbow_words = 1, window = 2, alpha = 0.2)#, min_alpha = 0.025)
    model.build_vocab(train_docs)
    for epoch in range(10):
        model.train(train_docs, total_examples = model.corpus_count, epochs = epoch)
        #model.alpha -= 0.002
        #model.min_alpha = model.alpha


    scores = []

    for doc in test_docs:
        dd = {}
        # Calculate the cosine similarity and return top 40 matches
        score = model.docvecs.most_similar([model.infer_vector(doc)],topn=40)
        key = " ".join(doc)
        for i in range(len(score)):
            # Get index and score
            x, y = score[i]
            #print(x)
            # Match sentence from other list
            nkey = ' '.join(comp_docs[x])
            dd[nkey] = y
        scores.append({key: dd})

    return scores

which works to calculate the similarity scores, but the issue here is that I have to train the model on all the sentences from both lists or one of the lists, then match. Is there a way to use Doc2Vec to just get the vectors, then compute the cosine similarity? To be clear, I'm trying to find the most similar sentences between lists. I'd expect an output like

scores = []
for s1 in list1:
    for s2 in list2:
        scores.append((s1, s2, similarity(s1, s2)))

print(scores)
[('what they ate for lunch', 'food eaten two days ago', 0.23567),
 ('what they ate for lunch', 'height in centimeters', 0.120),
 ('what they ate for lunch', 'id', 0.01023),
 ('height in inches', 'food eaten two days ago', 0.123),
 ('height in inches', 'height in centimeters', 0.8456),
 ('height in inches', 'id', 0.145),
 ('subjectid', 'food eaten two days ago', 0.156),
 ('subjectid', 'height in centimeters', 0.1345),
 ('subjectid', 'id', 0.9567)]
m13op22
  • 2,168
  • 2
  • 16
  • 35
  • Can you clarify what you mean by "most similar sentences between lists"? What would a couple example "lists" you provide as input, and then the desired output? (Separately: you're using an atypically large `alpha` and then calling `train()` multiple times yourself, leaving the `alpha` atypically large. It's much better to just leave the default `alpha` in place and call `train()` only once, with your desired count of `epochs`, to let the code do the right thing for you.) – gojomo Mar 08 '19 at 20:10
  • Thanks, @gojomo, I added an example. I found that `alpha` worked the best when training my model the first time, but that's good to know so I'll try some other values out. – m13op22 Mar 08 '19 at 22:06
  • So you want to calculate the pairwise similarities for each item in `list1` against `list2`? It looks like you already have working code, so what's the pending need? (Or, do you just not yet have a 'similarity()` function?) – gojomo Mar 08 '19 at 23:30
  • Note still that your code as currently shown is very atypical to the point of nonsensicality in its `alpha` and `train()` management. It's doing 10 `train()`s, but the 1st does 0 passes and the last does 9, and every `train()` decays the effective alpha from `0.2` to `0.0001` - a falling-and-rising sawtooth pattern. That's improper SGD & will lead to early texts always training with high alpha, late texts always with low alpha. You should get a much better model with more default/sane practices. – gojomo Mar 08 '19 at 23:32
  • @gojomo I wanted to see if I was using Doc2vec in a similar way I would calculate pairwise similarities and if not, if it was possible to do that with Doc2Vec. Thanks for the suggestions for better parameters. – m13op22 Mar 11 '19 at 21:30
  • @gojomo related, do you have good documentation for what typical Doc2vec parameters? I've been able to pick up a lot from your answers to other posts and changed my model to `model = Doc2Vec(dm=0, dbow_words = 1, window = 2)` and only trained it once. But I used 500 epochs to get the best matches, which seems rather high but could be because my model only has ~3,000 documents. – m13op22 Mar 14 '19 at 18:54
  • The defaults are fairly representative of typical parameters, but most published work does further parameter tweaking. In `Doc2Vec`, an `epochs` of 10-20 is more common than gensim's default (inherited from `Word2Vec`) of only 5. Most work uses 100s-of-thousands to millions of docs - so 3,000 docs is very small. Using more epochs and/or smaller vectors can sometimes squeeze some meaningfulness from smaller corpuses, but needing 500 epochs seems extreme, and perhaps indicative of other problems. (Again, if you're calling `train()` in a loop more than once, you're probably doing things wrong.) – gojomo Mar 14 '19 at 19:29
  • Thanks, I thought I mentioned that I was only calling `train()` once (my bad) with the line `model.train(train_docs, total_examples = model.corpus_count, epochs = 500)`. Yeah, since I have so few docs compared to other works, I thought more epochs would help. I could experiment with smaller vector sizes. – m13op22 Mar 14 '19 at 21:38
  • 1
    The code in the question still shows `train()` in a loop, with each `train()` performing more internal `epochs`. Maybe 500 is helpful for a short corpus (and perhaps small individual docs?), and further by giving the word-vectors (via `dbow_words=1`) more training keeps helping - it's just very much beyond what's typical in larger corpuses and published work. – gojomo Mar 15 '19 at 01:14
  • Ok, yes the individual docs are sentences rather than paragraphs (see sample data above). – m13op22 Mar 15 '19 at 13:58
  • Is your example in the question of `list1` meant to be one document, or three? (And are those meant to be literal examples of the document(s) texts, or descriptive hints of what the text might actually be, so you might have single-word documents that are just `['salad']`, `['57in']`, `['id9076']`?) – gojomo Mar 15 '19 at 15:22
  • Three documents meant to be literal examples of the document texts. There are one or two single-word documents, but most are at least two. – m13op22 Mar 15 '19 at 15:57
  • But then, of your 6 example documents, two are already just single words: `['subjectid']` and `['id']`. (Are those the only two 1-word docs in your 3000 docs?) And note that `'what they ate for lunch'` as a string isn't an adequate document for `Doc2Vec` - you need a `words` list that's already tokenized words, like say `['what', 'they', 'ate', 'for', 'lunch']`, **and** one or more `tags`. So when you say "see sample data above", it's not at all clear how what's in the question relates to actual documents presented to `Doc2Vec`. – gojomo Mar 15 '19 at 18:53

2 Answers2

1

If your concern is training the model and getting the result in runtime is time-consuming task. Then consider saving the mode. You can train your model in a separate file and save it to your disk.

Right after your training

model.save("similar_sentence.model")

Create a new file and load the model like below,

model = Doc2Vec.load("similar_sentence.model")

The model file will hold the vector from your trained sentences.

The model object can be saved and loaded in anywhere in your code.

Semantic “Similar Sentences” with your dataset-NLP

Shankar Ganesh Jayaraman
  • 1,401
  • 1
  • 16
  • 22
-1

Doc2vec can generate a vector if you provide it with the words you want it to generate a vector for, A doc2vec model would need to exist though. However, this model does not necessarily need to be trained on the sentences you're trying to compare. I don't know if doc2vec pregenerated models exist, but I do know you can import word2vec models that have pretrained vectors. Whether or not you want to do this depends a bit on the types of sentences you're comparing - generally the word2vec models are trained on corpuses like wikipedia, or 20newsgroup. So they might not have vectors (or poor vectors) for words that don't occur often in these articles, ie if you were trying to compare sentences with alot of scientific terms you might not want to use a pretrained model. However, you will cannot generate a vector without having first trained a model (I think this your core question).

Evan Mata
  • 500
  • 1
  • 6
  • 19
  • Also, gensim has pretrained models I believe. You might need to download them though. – Evan Mata Mar 08 '19 at 17:08
  • I'll look into what models might exist. My sentences are more scientific that probably wouldn't be contained in the typical `nltk` corpuses, which is why I was trying to do it without training a model. Am I on the right track for using Doc2vec to generate vectors of my sentences? – m13op22 Mar 08 '19 at 22:09
  • I personally know word2vec better than doc2vec, but your approach seems fine. I'm not entirely sure what kind of corpus you would want to train on, but you could look into the WOS (web of science) dataset. That said, for corpus's with specific vocabulary you typically don't want to use pregenerated vectors but instead train on your corpus. Basically, you always need to train an embedding model, but you want the model you're training on to have context similar to what you are using. There's not alot of downside to training on your corpus, unless its simply too small. – Evan Mata Mar 08 '19 at 22:32