- I have a list of strings containing 50 million search queries. [1-500+ words in each query].
- I also have a list of strings containing 500 words and phrases I need to return indices of search queries (1) that contain any word or phrase (2).
The goal is to only keep queries related to a certain topic (movies) and then use NLP to cluster these filtered queries (stemming -> tf_idf -> pca -> kmeans).
I tried to filter queries using nested loops, but it would take more than 10 hours to finish.
filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
for i, line in enumerate(f):
query, timestamp = line.strip().split('\t')
for word in key_words:
if word in query:
filtered.append(i)
I looked into solutions which use regex (word1|word2|...|wordN), but the problem is that i cannot combine queries into a large string since i need to filter irrelevant queries.
UPDATE: examples of logs and keywords
search_logs.txt
'query timestamp\n'
'the dark knight 2019-02-17 19:05:12\n'
'how to do a barrel roll 2019-02-17 19:05:13\n'
'watch movies 2019-02-17 19:05:13\n'
'porn 2019-02-17 19:05:13\n'
'news 2019-02-17 19:05:14\n'
'rami malek 2019-02-17 19:05:14\n'
'Traceback (most recent call last): File "t.py" 2019-02-17 19:05:15\n'
.......... # millions of other search queries
key_words = [
'movie',
'movies',
'cinema',
'oscar',
'oscars',
'george lucas',
'ben affleck',
'netflix',
.... # hundreds of other words and phrases
]