Suppose I have the sentence - "Jane is running"
And another list of sentences -
["Jane is a girl",
"Jane can run",
"Run a race",
"Sitting down on sofa",
"Sitting down on a chair",
"Sitting on a bench",
"Climbing a tree",
"Climbing a rock",
"Run to reach somewhere"]
Now my goal is, given the first sentence, which sentences does it match to.
The output needs to be something like -
"Jane is running" : "Jane can run", "Jane is a girl", "Run a race", "Run to reach somewhere"
Kindly take a note of the order of the output, in case of "Jane can run"
there are two matches, Jane
and run
, while the rest have either matched with Jane
or run
.
As for the main sentence, the words could have been in this case, ran
, run
, running
, Jan
, Janet
, June
, i.e. spelling errors and variations of words need to be considered.
The algorithm that I came up with is -
- Divide the main sentence into a list of words -
["Jane", "is", "running"]
. - Do the same for each sentence in the list.
- For every word in the main sentence check for matches in every word of every sentence in the list of sentences keeping an edit distance of 5 or 6.
- Group the sentences that match and sort them according to the maximum number of matches
This method feels a very brute-force approach to the problem. How can I improve this ?