Find most similar sentence in a large dataset of sentences

Question

I currently have a text file with around a million sentences, each on a new line. I am trying to build a solution where I can take a new sentence outside of this text file and have the program return the most similar sentence present in the file.

I have found some solutions which return the pair of sentences with the highest similarity INSIDE the existing dataset.For example this one. But that is not what I am going for. I want to be able to compare a new sentence with all of those in the text file.

Also, I am not sure if I should be focusing on semantic similarity or cosine similarity.

This is similar to the problem of finding the [most similar document](https://stats.stackexchange.com/questions/148744/finding-similar-documents-in-a-big-data-set) in a large dataset. — Anderson Green, Apr 09 '22 at 20:30

score 1 · Accepted Answer · answered Sep 21 '21 at 19:03

1

I advise you to read about Damerau–Levenshtein distance. I was also looking for a similar solution and settled on this algorithm.

There are implementations for Python:

answered Sep 21 '21 at 19:03

h1w

78
5

Thanks for you answer this will is the direction I am looking for! – mtedu Sep 23 '21 at 10:07
1

In very large datasets, it is not always feasible to compute the Levenshtein distance of every possible pair of sentences. There are [several other algorithms](https://stackoverflow.com/a/4640306/975097) that may be better for this purpose. – Anderson Green Apr 09 '22 at 20:26

Find most similar sentence in a large dataset of sentences

1 Answers1