6
  1. I have a list of strings containing 50 million search queries. [1-500+ words in each query].
  2. I also have a list of strings containing 500 words and phrases I need to return indices of search queries (1) that contain any word or phrase (2).

The goal is to only keep queries related to a certain topic (movies) and then use NLP to cluster these filtered queries (stemming -> tf_idf -> pca -> kmeans).

I tried to filter queries using nested loops, but it would take more than 10 hours to finish.

filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        query, timestamp = line.strip().split('\t')
        for word in key_words:
            if word in query:
                filtered.append(i)

I looked into solutions which use regex (word1|word2|...|wordN), but the problem is that i cannot combine queries into a large string since i need to filter irrelevant queries.

UPDATE: examples of logs and keywords

search_logs.txt
'query  timestamp\n'
'the dark knight    2019-02-17 19:05:12\n'
'how to do a barrel roll    2019-02-17 19:05:13\n'
'watch movies   2019-02-17 19:05:13\n'
'porn   2019-02-17 19:05:13\n'
'news   2019-02-17 19:05:14\n'
'rami malek 2019-02-17 19:05:14\n'
'Traceback (most recent call last): File "t.py" 2019-02-17 19:05:15\n'
.......... # millions of other search queries
key_words = [
    'movie',
    'movies',
    'cinema',
    'oscar',
    'oscars',
    'george lucas',
    'ben affleck',
    'netflix',
    .... # hundreds of other words and phrases
]
Superbman
  • 787
  • 1
  • 8
  • 24
  • With this much data, you should expect a long running time. – Tim Biegeleisen Apr 21 '19 at 15:08
  • True, but i suspect there are more efficient ways to do this – Superbman Apr 21 '19 at 15:12
  • 1
    You could look into multiprocessing to run the algorithm in parallel on all your available cores. Python is single-threaded and generally slow, so I'd prefer to write this sort of thing in C as a multithreaded application. Regex is probably not a performance-oriented solution, either. – ggorlen Apr 21 '19 at 15:15
  • 1
    Have you seen [this thread](https://stackoverflow.com/a/42789508/3832970)? With a regex trie, you may create a compact regex that will search exactly for your strings. – Wiktor Stribiżew Apr 21 '19 at 16:05
  • Nope, i'll give it a try. – Superbman Apr 21 '19 at 16:16
  • If the regex trie can be made even faster (at the cost of a O(n^2 preprocessing) by converting the regex trie to an actual FSA, minimizing it and converting it back. This way you have an even smaller regex. But the minimization operation is very expensive. There is a great post about implementing this in rust https://blog.burntsushi.net/transducers/ – Tommaso Fontana Feb 03 '20 at 16:55

2 Answers2

1

Set comparison - Jaccard Similarity

Jaccard similarity is a comparison metric to compare pair of words: https://www.statology.org/jaccard-similarity/

I'll suggest three methods -

  1. Using set comparison: save the keyword list as a set and then we convert each of the query string into a set on the fly, and compare it with the keyword set

eg:

# indexing the keyword list
s = set(keyword)

# pairwise comparison
idx_list = []
for i in range(len(search_arr)):
    if set(search_arr[i].split(' ')).intersection(s):
        idx_list.append(i)

Something like this would give you the ability to search, but still here the pairwise comparison would take atleast O(N).

  1. So the best way would be to use reverse indexing, where we take all the unique words in the search query and build a temporary index and then query the keyword over that, to get the list indexes

eg:

# search query indexing using hashmap
hmap = dict()
for i in range(len(search_list)):
    txt = search_list[i].split(' ')
    for word in txt:
        if word not in hmap:
            hmap[word] = set(i)
        else:
            hmap.add(i)

this will basically create your search index, which can be used to query over the keywords as an reverse index search

  1. If even this is not efficient, please try using LSH

https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134

0

I would suggest FlashText, which was developed to be very efficient for exactly this kind of task. It will work as long as the keywords you are searching for are plain strings (as opposed to complicated regexes).

aab
  • 10,858
  • 22
  • 38