I have a large set (says 30 million) of concepts strings (maximum 13 words per string) in a database. Given an input string (maybe maximum 3 sentences), I would like to find all the concepts from database available in the input string.
I am using python for this purpose. Loaded the all concepts from the database into a list. Loop through the concept list and try to find whether that concept is available in the input string. As I had to search it kind of sequentially, the process takes long and I will have to do it for hundreds of input string.
For pruning some iteration, I tokenized the input string and try to load only the concepts having any one of the tokens and the lenght of the concepts has to be less or equal to the length of the input string. It requires an sql query to load these short listed concepts into the list. Still the list might contain 20 million concepts. The process is not that fast.
Any idea how this process could be made more efficient?
For better visualization I am giving a little pythonic example:
inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
#load concept list from the database that have any of the words in input string (after removing stop words). Assume the list is as follows.
concepts = ["cow", "domestic animal", "domestic bird", "domestic cat", "domestic dog", "one eye", "two eyes", "two legs", "four legs", "two ears"]
for c in concepts:
if c in inputString:
print ('found ' + c + ' in ' + inputString)
It would be great if you can give me some suggestions to make it more efficient.