3

I'm trying to search through a list of strings (sentences) and check whether or not they contain a specific set of substrings. To do this I'm using Python's 'any' function

sentences = ["I am in London tonight",
                   "I am in San Fran tomorrow",
                   "I am in Paris next Wednesday"]

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = ["london", "paris", "san fran"] 
listOfTimePhrases = ["tonight", "tomorrow", "week", "monday", "wednesday", "month"]

start = time.time()

sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in pos for x in listOfTimePhrases)]

end = time.time()

print(end-start)

If you imagine that my lists are extremely large, what I find is that my the time elapsed on the two 'any' statements, is considerably long. I get roughly 2 secs for two such 'any' queries. Do you have any idea on why it is taking so long and do you know of any way to make the code faster?

Thanks

SparkAndShine
  • 17,001
  • 22
  • 90
  • 134
kPow989
  • 426
  • 5
  • 22

6 Answers6

4

Don't enumerate your sentences twice. You could do both checks with one loop over your sentences.

sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sent in enumerate(sentences):
    if any(x in sent for x in listOfPlaces):
        sntceIdxofPlaces.append(pos)
    if any(x in sent for x in listOfTimePhrases):
        sntceIdxofTimes.append(pos)
joebeeson
  • 4,159
  • 1
  • 22
  • 29
  • 3
    This answer combined with using sets for your lists of data is the way to go. – Morgan Thrapp Dec 07 '16 at 17:36
  • @joeb Thanks for that. I have just tested the single for loop with the set advice everyone else kindly gave. I'm not sure if I'm being silly but there wasn't much gain. – kPow989 Dec 07 '16 at 17:43
  • The gain from this is negligible according to my timing tests (`foo` versus `foo2` in my answer, below). There's also no gain from using sets, because we're having to iterate over the sets, not test them for membership. – jez Dec 07 '16 at 18:16
  • Hi, I realised that the reason I did not see much gain, was probably because my list of sentences was always very short (less than 6). And so the overhead from two enumerations wasn't too bad in my case. Thanks – kPow989 Dec 07 '16 at 22:16
2

There is a major inefficiency. Namely, you're not using sets. Checking for membership in a set is a very efficient operation (O(1) vs O(n) for a list).

sentences = ["I am in London tonight",
                   "I am in San Fran tomorrow",
                   "I am in Paris next Wednesday"]

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = {"london", "paris", "san fran"} 
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}

start = time.time()

sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfTimePhrases)]

end = time.time()

print(end-start)
Morgan Thrapp
  • 9,748
  • 3
  • 46
  • 67
  • Thanks Morgan. I've just tested using sets, any my code is roughly 10 times faster or so. If I use sent.split() will I not be splitting up longer strings "San Fran"? I might miss them on the search no? Thank you again – kPow989 Dec 07 '16 at 17:30
  • Almost anytime you have a static set of unordered data that you're doing lookups on, it's worth using a set. Lookup on a set is an O(1) operation vs O(n) for a list. – Morgan Thrapp Dec 07 '16 at 17:31
  • 2
    The `sent.split()` approach catches most cases but unfortunately misses `"san fran"` because that gets split into two words. – jez Dec 07 '16 at 17:32
  • @jez You're right, I completely missed that. I removed that part of my answer. – Morgan Thrapp Dec 07 '16 at 17:33
1

Here are three alternative approaches that use lookup into each sentence rather than lookup into the lists of target phrases. All of them scale badly with the length of the lists of target phrases.

sentences = [
    "I am the walrus",
    "I am in London",
    "I am in London tonight",
    "I am in San Fran tomorrow",
    "I am in Paris next Wednesday"
]
sentences *= 1000   # really we want to examine large `listOfPlaces` and `listOfTimePhrases`, but these must be unique and so are harder to generate---this is an quicker dirtier way to test timing scalability

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = {"london", "paris", "san fran"}
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}

# preprocess to guard against substring false positives and case-mismatch false negatives:
sentences = [ ' ' + x.lower() + ' ' for x in sentences ]
listOfPlaces = { ' ' + x.lower() + ' ' for x in listOfPlaces }
listOfTimePhrases = { ' ' + x.lower() + ' ' for x in listOfTimePhrases }

#listOfPlaces = list( listOfPlaces )
#listOfTimePhrases = list( listOfTimePhrases )

def foo():
    sntceIdxofPlaces = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfPlaces)]
    sntceIdxofTimes  = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfTimePhrases)]
    return sntceIdxofPlaces, sntceIdxofTimes

def foo2():
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        if any(x in sentence for x in listOfPlaces): sntceIdxofPlaces.append(pos)
        if any(x in sentence for x in listOfTimePhrases): sntceIdxofTimes.append(pos)
    return sntceIdxofPlaces, sntceIdxofTimes

def foo3():
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        for x in listOfPlaces:
            if x in sentence: sntceIdxofPlaces.append(pos); break
        for x in listOfTimePhrases:
            if x in sentence: sntceIdxofTimes.append(pos); break
    return sntceIdxofPlaces, sntceIdxofTimes

Here are the timing results:

In [171]: timeit foo()
100 loops, best of 3: 15.6 ms per loop

In [172]: timeit foo2()
100 loops, best of 3: 16 ms per loop

In [173]: timeit foo3()
100 loops, best of 3: 8.07 ms per loop

It seems like any() might be inefficient, much to my surprise. It might be running its input generator to the end even when a match has been found early and the answer is already known. I understand that it's not supposed to work like this, but I can't otherwise explain the factor-2 difference in running time between foo2() and foo3() which appear to give identical outputs.

Also: since listOfPlaces and listOfTimePhrases are being iterated over, rather than tested for membership, it doesn't seem to make a difference to the timing whether they are sets or lists.

jez
  • 14,867
  • 5
  • 37
  • 64
  • The `any` function doesn't perform like that based on [this](http://stackoverflow.com/a/16505590/3826759) answer... – double_j Dec 07 '16 at 17:53
  • I wouldn't have thought so either, but the timing results are striking. And they're repeatable (I've just run them several times in different orders under anaconda Python 2.7.12 on a 2013 MacBook). The three functions do seem to be returning identical outputs, so if there's a bug in one of them I can't see it. – jez Dec 07 '16 at 18:03
  • If I read correctly something similar was found here https://www.dotnetperls.com/any-python – kPow989 Dec 07 '16 at 22:23
0

Use sets instead of lists for the listOfPlaces and listOfTimePhrases. Sets have considerably faster lookup time:

listOfPlaces = set(["london", "paris", "san fran"])
listOfTimePhrases = set(["tonight", "tomorrow", "week", "monday", "wednesday", "month"])
DYZ
  • 55,249
  • 10
  • 64
  • 93
0

Here is a solution that takes advantage of the fast lookup into listOfPlaces and listOfTimePhrases as sets, while at the same time taking account of possible multi-word phrases. It runs fast even when listOfPlaces and listOfTimePhrases contain 1000s of elements.

sentences = [
    "I am the walrus",
    "I am in London",
    "I am in London tonight",
    "I am in San Fran tomorrow",
    "I am in Paris next Wednesday"
]
sentences *= 1000

# Imagine the following lists to contain 1000's of strings 
placePhrases = {"london", "paris", "san fran"}
timePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}


# preprocess to guard against case-mismatch false negatives:
sentences = [ x.lower() for x in sentences ]
placePhrases = { x.lower() for x in placePhrases }
timePhrases = { x.lower() for x in timePhrases }

# create additional sets in which incomplete phrases can be looked up:
placePrefixes = set()
for phrase in placePhrases:
    words = phrase.split()
    if len(words) > 1:
        for i in range(1, len(words)):
            placePrefixes.add(' '.join(words[:i]))
timePrefixes = set()
for phrase in timePhrases:
    words = phrase.split()
    if len(words) > 1:
        for i in range(1, len(words)):
            timePrefixes.add(' '.join(words[:i]))

def scan(sentences):
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        hasPlace = hasTime = False
        placePrefix = timePrefix = ''
        for word in sentence.split():
            if not hasPlace:
                placePhrase = placePrefix + (' ' if placePrefix else '') + word
                if placePhrase in placePrefixes:
                    placePrefix = placePhrase
                else:
                    placePrefix = ''
                    if placePhrase in placePhrases: hasPlace = True
            if not hasTime:
                timePhrase = timePrefix + (' ' if timePrefix else '') + word
                if timePhrase in timePrefixes:
                    timePrefix = timePhrase
                else:
                    timePrefix = ''
                    if timePhrase in timePhrases: hasTime = True
            if hasTime and hasPlace: break
        if hasPlace: sntceIdxofPlaces.append(pos)
        if hasTime: sntceIdxofTimes.append(pos)
    return sntceIdxofPlaces, sntceIdxofTimes
jez
  • 14,867
  • 5
  • 37
  • 64
0

I managed to find a much faster way to do this using scikit's countvectorise function

sentences = ["I am in London tonight",
               "I am in San Fran tomorrow",
               "I am in Paris next Wednesday"]

listOfPlaces = ["london", "paris", "san fran"]

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfPlaces)

# We now get in the next step a vector of size len(sentences) x len(listOfPlaces)           
taggedSentences = cv.fit_transform(sentences).toarray()

aggregateTags = np.sum(taggedSentences, axis=1)

We ultimately get a vector of size len(sentences) by 1, in which each row has a tally of the number of times words from the wordlist appeared in each sentence.

I found that the results with large data sets to be exceptionally fast like (0.02s)

kPow989
  • 426
  • 5
  • 22