0

Imagine that I have a list of tokens:

tokens_to_search = [
  'fox.com',
  'australia',
  'messi',
  'ronaldo',
  'British premier league'
]

And I have a sentence, which may include some words, relevant to the tokens_to_search content:

sentence = 'Messi scored a goal in the premier league, watch on the Fox News'

The sentence can be split into tokens:

tokens_from_sentence = [
  'messi',
  ...,
  'premier',
  'league',
  ...,
  'fox',
  'news'
]

How can I detect the words from the tokens_to_search into the tokens_from_sentence with some fuzzy search? So the result will be

[
  'fox.com',
  'messi',
  'British premier league'
]

The simple approach is to do a nested loop by calculating some token distance, but it's O(N*M). Maybe there's a smart way to do this?

Thanks in advance!

Sergey Potekhin
  • 621
  • 1
  • 11
  • 27
  • 1
    does it have to be a fuzzy search? or you are just trying to improve performance? – Anurag Wagh Jun 05 '20 at 10:35
  • @AnuragWagh yeah, it has to be. – Sergey Potekhin Jun 05 '20 at 16:05
  • 1
    Look into [that question with my answer](https://stackoverflow.com/a/58791875/140837). It requires to index `tokens_to_search` which take some time, but might payoff in the long run since queries are very fast (100 times faster than fuzzywuzzy with better results). Also look into https://stackoverflow.com/q/52046394/140837 to learn how to link words or group of words into a Knowledge Base. – amirouche Jun 30 '20 at 07:42

0 Answers0