I have a lot of text documents on the one hand and a huge list of Keywords (Strings) on the other hand. Now I'm interested, which of these keywords are contained in the documents.
At the moment I'm using a monstrous auto generated regex:
keywords = %w(Key1, Key2, Key3)
regx = Regexp.new('\b(' + keywords.join('|') + ')\b','i')
documents.each |d|
d.scan(regx)
end
This worked great for a List of a few hundred keywords but now I'm using about 50000 keywords and it's slowing down too much.
Is there a better way doing such an operation using ruby?
EDIT:
- The Documents are typical news articles like news about recent sport events as you can find via google news for example. In my testset each article contains about 1000 Words
- The Keywords can be single words but could also be phrases containing multiple words like 'Franz Beckenbauer' or 'Russel Wilson'.
- I'm interested only in complete matches - so searching for 'diction' should only match 'diction', not 'dictionary'