1

I have two lists of words, like so:

LIST1 = ['whisky', 'spirits', 'liqueur']
LIST2 = ['bottle', 'barrel', 'can', 'cup']

I also have a string of text (call the string object TEXT) that I would like to search. The end result of the search should be a count of the number of times each word in LIST1 appears in TEXT within a given distance (e.g., within 10 words) of any of the words in LIST2. I can imagine complicated methods of accomplishing this by iterating regular expression searches over both lists. But my actual LIST1 and LIST2 are quite long, and the text that I am searching is large, so iterating isn't a good option. I was hopeful that there might be a purpose built tool when I found NLTK, but unless I am missing something there is no functionality of the type I need. Is there an easy way to accomplish my task?

Note: I can't tell for sure, but I think my problem may be similar to the one discussed in this unanswered post.

hallque
  • 117
  • 7
  • If the the condition is that any word in LIST1 is within 10 words of any word in LIST2 then i'm afraid there is no way to avoid iterating. – Alexander Apr 05 '22 at 03:27
  • probably even `NLTK` has to do it with some iteration - but it may do it C/C++ code so it may work faster. But if it possiblem then I would split text into list of words, and work with indexes in this list. – furas Apr 05 '22 at 03:48
  • If the text is long, but the number of occurrences of words of interest is small, then it's probably better to scan the text just once to extract the indices of every word of interest. Then compare the extracted indices to find out if two words occur within the given distance. – Stef Apr 05 '22 at 08:55

0 Answers0