2

I need some advice on the following problem.

I'm given a set of weighted keywords (by percentage) and need to find a text in a database that best matches those keywords. I will give an example.

I'm presented with these keywords

  • Sun(90%)
  • National Park(85% some keywords contain 2 words)
  • Landmark(60%)

Now lets say my database contains 3 entries of texts e.g

  1. Going-to-the-Sun Road is a scenic mountain road in the Rocky Mountains of the western United States, in Glacier National Park in Montana.
  2. Everybody has a little bit of the sun and moon in them. Everybody has a little bit of man, woman, and animal in them.
  3. A hybrid car is one that uses more than one means of propulsion - that means combining a petrol or diesel engine with an electric motor.

Obviously the first text is the one that best describes the given set of keywords so this is what I want to recommend to the user. Following the second text that somewhat relates with the "sun" keyword and that could be an acceptable choice too.

The 3rd text is totally irrelevant and shall only be recommended as a last resort when everything else fails.

I'm totally new to that kind of stuff so I need some advice as to which technologies/algorithms I should use. Seems like there is some machine learning (nlp) involved or some kind of fuzzy logic. I'm not really sure.

ThanosFisherman
  • 5,626
  • 12
  • 38
  • 63
  • The second sentence is a classical *demagoguery* and metaphorical thinking that can't be realistically achieved otherwise but by having a bad intent, typical to irrational human beings but not to NLP engines, unless you want to build an NLP system that will deceit, mislead, misinterpret and confuse. So in your model you have to somehow provide handling metaphors and demagoguery. – dmitryro Oct 26 '20 at 19:44
  • Perhaps my second example wasn't an ideal one, but I do not need to handle metaphors anyway so it's ok for me. – ThanosFisherman Oct 26 '20 at 21:22
  • If you have weights the very basic proximity can be achieved using *levenshtein distance* - 1 full match, 0 no match (100% full match, 0 no match). If basic methods such as Levenshtein/Logistic Regression/SVM/KNN are not enough, there are of course more ways using fuzzy sets and a various – dmitryro Oct 26 '20 at 21:35

1 Answers1

2

You need to use a combination of query terms boosting and synonyms

Look into Is there a way to do fuzzy string matching for words on string?

amirouche
  • 7,682
  • 6
  • 40
  • 94