0

I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.

Let's assume, I have some sets of keywords:

[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]

and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.. Obviously, it fits well to the second set of keywords.

Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck instead of ducks, or run instead of running) to enhance the algorithm. Now we're talking NLP already.

I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?

Related: What NLP tools to use to match phrases having similar meaning or semantics

jenzopr
  • 70
  • 1
  • 9

1 Answers1

0

There is not one correct solution for such a task. you have to try what fits your problem!

Possible ways to solve your problem I can think of:

  • Matching: either exact or more elaborated such as lemma/stemming, or Levensthein.
  • Embedding Similarity: I guess word similarity would outperform document-keywords similarity, but again, just experiment with it.
  • Classification: Your problem seems to be a classic classification problem, which each set being one class. If you don't have enough labeled training data, you could try active-learning.
chefhose
  • 2,399
  • 1
  • 21
  • 32
  • Thanks a lot for the suggestions! Can you elaborate a bit on your second point? Isn't a document embedding just the average of its word embeddings? How to you envision implementing this point? – jenzopr Nov 17 '20 at 12:56
  • 1
    average embeddings is one possibility, but you could also loop through your sentence and for each word comparing the word embedding with the keywords word embeddings. The pair with the minimal distance is then chosen as keyword set for example. – chefhose Nov 17 '20 at 13:09
  • Uh, that's great! It can be nicely combined with lemmatizing and e.g. averaging over the top three/five/ten minimal distances to enhance robustness. I'll report on my findings! – jenzopr Nov 17 '20 at 14:03