Here are three alternative approaches that use lookup into each sentence rather than lookup into the lists of target phrases. All of them scale badly with the length of the lists of target phrases.
sentences = [
"I am the walrus",
"I am in London",
"I am in London tonight",
"I am in San Fran tomorrow",
"I am in Paris next Wednesday"
]
sentences *= 1000 # really we want to examine large `listOfPlaces` and `listOfTimePhrases`, but these must be unique and so are harder to generate---this is an quicker dirtier way to test timing scalability
# Imagine the following lists to contain 1000's of strings
listOfPlaces = {"london", "paris", "san fran"}
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}
# preprocess to guard against substring false positives and case-mismatch false negatives:
sentences = [ ' ' + x.lower() + ' ' for x in sentences ]
listOfPlaces = { ' ' + x.lower() + ' ' for x in listOfPlaces }
listOfTimePhrases = { ' ' + x.lower() + ' ' for x in listOfTimePhrases }
#listOfPlaces = list( listOfPlaces )
#listOfTimePhrases = list( listOfTimePhrases )
def foo():
sntceIdxofPlaces = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfTimePhrases)]
return sntceIdxofPlaces, sntceIdxofTimes
def foo2():
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sentence in enumerate(sentences):
if any(x in sentence for x in listOfPlaces): sntceIdxofPlaces.append(pos)
if any(x in sentence for x in listOfTimePhrases): sntceIdxofTimes.append(pos)
return sntceIdxofPlaces, sntceIdxofTimes
def foo3():
sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sentence in enumerate(sentences):
for x in listOfPlaces:
if x in sentence: sntceIdxofPlaces.append(pos); break
for x in listOfTimePhrases:
if x in sentence: sntceIdxofTimes.append(pos); break
return sntceIdxofPlaces, sntceIdxofTimes
Here are the timing results:
In [171]: timeit foo()
100 loops, best of 3: 15.6 ms per loop
In [172]: timeit foo2()
100 loops, best of 3: 16 ms per loop
In [173]: timeit foo3()
100 loops, best of 3: 8.07 ms per loop
It seems like any()
might be inefficient, much to my surprise. It might be running its input generator to the end even when a match has been found early and the answer is already known. I understand that it's not supposed to work like this, but I can't otherwise explain the factor-2 difference in running time between foo2()
and foo3()
which appear to give identical outputs.
Also: since listOfPlaces
and listOfTimePhrases
are being iterated over, rather than tested for membership, it doesn't seem to make a difference to the timing whether they are set
s or list
s.