I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.
Let's assume, I have some sets of keywords:
[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]
and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.
. Obviously, it fits well to the second set of keywords.
Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck
instead of ducks
, or run
instead of running
) to enhance the algorithm. Now we're talking NLP already.
I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?
Related: What NLP tools to use to match phrases having similar meaning or semantics