0

I have a large set (says 30 million) of concepts strings (maximum 13 words per string) in a database. Given an input string (maybe maximum 3 sentences), I would like to find all the concepts from database available in the input string.

I am using python for this purpose. Loaded the all concepts from the database into a list. Loop through the concept list and try to find whether that concept is available in the input string. As I had to search it kind of sequentially, the process takes long and I will have to do it for hundreds of input string.

For pruning some iteration, I tokenized the input string and try to load only the concepts having any one of the tokens and the lenght of the concepts has to be less or equal to the length of the input string. It requires an sql query to load these short listed concepts into the list. Still the list might contain 20 million concepts. The process is not that fast.

Any idea how this process could be made more efficient?

For better visualization I am giving a little pythonic example:

 inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
#load concept list from the database that have any of the words in input string (after removing stop words). Assume the list is as follows.

concepts = ["cow", "domestic animal", "domestic bird", "domestic cat", "domestic dog", "one eye", "two eyes", "two legs", "four legs", "two ears"]

for c in concepts:
    if c in inputString:
        print ('found ' + c + ' in ' + inputString) 

It would be great if you can give me some suggestions to make it more efficient.

  • 1
    This probably isn't the answer you're looking for, but `print` statements are very resource-heavy. Remove your print statement, save it to a list and then print the list at the end. It will be significantly faster. – emsimpson92 Jun 28 '18 at 22:58
  • Thanks for the input. Much appreciated. I just showed it for example purpose though, but I will keep in mind. – user6725114 Jun 28 '18 at 23:04
  • Not a problem. I had a similar problem where I was working with a large dataset and when I removed the print statement it executed about 5-10x faster. – emsimpson92 Jun 28 '18 at 23:06

1 Answers1

1

You should make use of sets, which are much faster than lists and full-text search in finding items.

Put these concepts into a dict of sets, indexed by the number of words. Then split inputString into a list of words, and then use a rolling window of the number of words on this list to test if these words exist in the set of the index of the same number of words.

So given the following initialization:

from collections import defaultdict
import re
inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
concepts = ["cow", "domestic animal", "domestic bird", "forever and ever", "practice makes perfect", "i will be back"]

We break down concepts into a dict of sets, indexed by the number of words of the concepts contained within a set:

concept_sets = defaultdict(set)
for concept in concepts:
    concept_sets[len(concept.split())].add(concept)

So that concept_sets becomes:

{1: {'cow'}, 2: {'domestic bird', 'domestic animal'}, 3: {'practice makes perfect', "forever and ever"}, 4: {'i will be back'}}

Then we turn inputString into a list of words in lowercase so that the match can be case-insensitive. Note that you may want to refine the regex here so that it may include certain other characters as a "word".

input_words = list(map(str.lower, re.findall(r'[a-z]+', inputString, re.IGNORECASE)))

Finally, we loop through each concept set in concept_sets with its number of words, and go through the word list from the input in a rolling window of the same number of words, and test if the words exists in the set.

for num_words, concept_set in concept_sets.items():
    for i in range(len(input_words) - num_words + 1):
        words = ' '.join(input_words[i: i + num_words])
        if words in concept_set:
            print("found '%s' in '%s'" % (words, inputString))

This outputs:

found 'cow' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'
found 'domestic animal' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Thanks for your detailed answer. Actually, my main scenario is a bit different and as part of it I also remove all the spaces so that the algorithm can capture if two words are combined due to typo. But your idea seems very interesting and I will try to map my method to it and will see the performance. Thanks again. – user6725114 Jul 02 '18 at 19:20
  • You're welcome. Please mark the answer as accepted if you do find it a good solution in your testing. I'd also like to know how many times faster this solution is than your original one. Cheers. – blhsing Jul 02 '18 at 19:22
  • 1
    I have now implemented the solution with a set of dictionaries and removed the SQL queries by prepossessing into dictionaries and save in cPickle. Later I load the dictionaries into main memory from cPickle. Used some pruning techniques to reduce dictionary access. The initial loading takes time, but searching is significantly faster. Next, I will work on reducing the initial loading time from cPickle. Saw some ways to deal with the loading (https://stackoverflow.com/questions/26860051/how-to-reduce-the-time-taken-to-load-a-pickle-file-in-python), but it is out of scope now. Thanks! – user6725114 Jul 09 '18 at 19:31