0

I currently have a text file with around a million sentences, each on a new line. I am trying to build a solution where I can take a new sentence outside of this text file and have the program return the most similar sentence present in the file.

I have found some solutions which return the pair of sentences with the highest similarity INSIDE the existing dataset.For example this one. But that is not what I am going for. I want to be able to compare a new sentence with all of those in the text file.

Also, I am not sure if I should be focusing on semantic similarity or cosine similarity.

mtedu
  • 41
  • 7

1 Answers1

1

I advise you to read about Damerau–Levenshtein distance. I was also looking for a similar solution and settled on this algorithm.

There are implementations for Python:

h1w
  • 78
  • 5
  • Thanks for you answer this will is the direction I am looking for! – mtedu Sep 23 '21 at 10:07
  • 1
    In very large datasets, it is not always feasible to compute the Levenshtein distance of every possible pair of sentences. There are [several other algorithms](https://stackoverflow.com/a/4640306/975097) that may be better for this purpose. – Anderson Green Apr 09 '22 at 20:26