1

Suppose I have the sentence - "Jane is running"

And another list of sentences -

["Jane is a girl", "Jane can run", "Run a race", "Sitting down on sofa", "Sitting down on a chair", "Sitting on a bench", "Climbing a tree", "Climbing a rock", "Run to reach somewhere"]

Now my goal is, given the first sentence, which sentences does it match to.

The output needs to be something like - "Jane is running" : "Jane can run", "Jane is a girl", "Run a race", "Run to reach somewhere"

Kindly take a note of the order of the output, in case of "Jane can run" there are two matches, Jane and run, while the rest have either matched with Jane or run.

As for the main sentence, the words could have been in this case, ran, run, running, Jan, Janet, June, i.e. spelling errors and variations of words need to be considered.

The algorithm that I came up with is -

  1. Divide the main sentence into a list of words - ["Jane", "is", "running"].
  2. Do the same for each sentence in the list.
  3. For every word in the main sentence check for matches in every word of every sentence in the list of sentences keeping an edit distance of 5 or 6.
  4. Group the sentences that match and sort them according to the maximum number of matches

This method feels a very brute-force approach to the problem. How can I improve this ?

daddyodevil
  • 184
  • 2
  • 13
  • doesn't directly answer your question but read up on Lucene and ElasticSearch, the way they tokenise and index searches may give you some clues – Rich Feb 21 '18 at 13:59
  • Probably this can be of some help: https://stackoverflow.com/questions/17388213/find-the-similarity-percent-between-two-strings – Georgy Feb 21 '18 at 14:02
  • also read on natural language processing, specially the part on stemming and lemmatization (https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html). There are oss python libs providing those features. – bruno desthuilliers Feb 21 '18 at 14:03
  • you may also want to have a look at whoosh (https://bitbucket.org/mchaput/whoosh/wiki/Home) which is a pure-python fulltext search engine. – bruno desthuilliers Feb 21 '18 at 14:05

0 Answers0