This answer will address improving performance without using concurrency.
The way you structured your search you are looking for 13 million unique things in each sentence. You said it takes 3-5 minutes for each sentence and that the word lengths in concepts
range from one to ten.
I think you can improve the search time by making a set of concepts
(either initially when constructed or from your list) then splitting each sentence into strings of one to ten (consecutive) words and testing for membership in the set.
Example of a sentence split into 4 word strings:
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
('mining', 'is', 'the', 'process'),
('is', 'the', 'process', 'of'),
('the', 'process', 'of', 'discovering'),
('process', 'of', 'discovering', 'patterns'),
('of', 'discovering', 'patterns', 'in'),
('discovering', 'patterns', 'in', 'large'),
('patterns', 'in', 'large', 'data'),
('in', 'large', 'data', 'sets'),
('large', 'data', 'sets', 'involving'),
('data', 'sets', 'involving', 'methods'),
('sets', 'involving', 'methods', 'at'),
('involving', 'methods', 'at', 'the'),
('methods', 'at', 'the', 'intersection'),
('at', 'the', 'intersection', 'of'),
('the', 'intersection', 'of', 'machine'),
('intersection', 'of', 'machine', 'learning'),
('of', 'machine', 'learning', 'statistics'),
('machine', 'learning', 'statistics', 'and'),
('learning', 'statistics', 'and', 'database'),
('statistics', 'and', 'database', 'systems')]
Process:
concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
if meme in concepts:
#keep it
#two words
for meme in zip(sentence,sentence[1:]):
if ' '.join(meme) in concepts:
#keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
if ' '.join(meme) in concepts:
#keep it
Adapting an itertools recipe (pairwise) you can automate that process of making n-word strings from a sentence:
from itertools import tee
def nwise(iterable, n=2):
"s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
iterables = tee(iterable, n)
# advance each iterable to the appropriate starting point
for i, thing in enumerate(iterables[1:],1):
for _ in range(i):
next(thing, None)
return zip(*iterables)
Testing each sentence looks like this
sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(sentence,n):
if ' '.join(meme) in concepts:
#keep meme
I made a set of 13e6 random strings with 20 characters each to approximate concepts
.
import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))
Testing a four or forty character string for membership in data
consistently takes about 60 nanoseconds. A one hundred word sentence has 955 one to ten word strings so searching that sentence should take ~60 microseconds.
The first sentence from your example 'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
has 195 possible concepts (one to ten word strings). Timing for the following two functions is about the same: about 140 microseconds for f
and 150 microseconds for g
:
def f(sentence, data=data, nwise=nwise):
'''iterate over memes in sentence and see if they are in data'''
sentence = sentence.strip().split()
found = []
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(sentence,n):
meme = ' '.join(meme)
if meme in data:
found.append(meme)
return found
def g(sentence, data=data, nwise=nwise):
'make a set of the memes in sentence then find its intersection with data'''
sentence = sentence.strip().split()
test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
found = test_strings.intersection(data)
return found
So these are just approximations since I'm not using your actual data but it should speed things up quite a bit.
After testing with your example data I found that g
won't work if a concept appears twice in a sentence.
So here it is all together with the concepts listed in the order they are found in each sentence. The new version of f
will take longer but the added time should be relatively small. If possible would you post a comment letting me know how much longer it is than the original? (I'm curious).
from itertools import tee
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
concepts = set(concepts)
def nwise(iterable, n=2):
"s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
iterables = tee(iterable, n)
# advance each iterable to the appropriate starting point
for i, thing in enumerate(iterables[1:],1):
for _ in range(i):
next(thing, None)
return zip(*iterables)
def f(sentence, concepts=concepts, nwise=nwise):
'''iterate over memes in sentence and see if they are in concepts'''
indices = set()
#print(sentence)
words = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(words,n):
meme = ' '.join(meme)
if meme in concepts:
start = sentence.find(meme)
end = len(meme)+start
while (start,end) in indices:
#print(f'{meme} already found at character:{start} - looking for another one...')
start = sentence.find(meme, end)
end = len(meme)+start
indices.add((start, end))
return [sentence[start:end] for (start,end) in sorted(indices)]
###########
results = []
for sentence in sentences:
results.append(f(sentence))
#print(f'{sentence}\n\t{results[-1]})')
In [20]: results
Out[20]:
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
['data mining', 'interdisciplinary subfield', 'information', 'information'],
['data mining', 'knowledge discovery', 'databases process', 'process']]