I have a list of keywords, for example:
keywords = ['airbnb.com', 'booking', 'deliveroo.uk - UK', ...]
My goal is to define the parameter token_pattern
of CountVectorizer
by concatenating all keywords.
The idea is this:
token_pattern = '|'.join([pattern_keyword_1, pattern_keyword_2, ...])
What interests me is that it matches the exact occurrences in the text and not the substrings.
For example, if I have 'def.com'
in the keywords I DON'T want it to match 'abcdef.com'
.
Is it possible to do it?
Thanks in advance.