I have a set of queries, where some are merely portions of the eventual search string. I need to clean the partial strings from a very long collection of queries. Is a fast way to do this across potentially millions of sets like this?
t = {u'house prices',
u'how ',
u'how man',
u'how many animals go ex',
u'how many animals go extinted eac',
u'how many animals go extinted each ',
u'how many species go',
u'how many species go extin',
u'how many species go extinet each yea',
u'how many species go extinet each year?'}
I would like to retain only:
t = {u'house prices',
u'how many species go extinet each year?',
u'how many animals go extinted each '}
Here's the solution from @Alex Hall, edited to catch the final string (the concatenation of '-+-' does this)
# Print out the unique strings
q = sorted(list(t)) + ['-+-']
for i in range(len(q) - 1):
if not q[i+1].startswith(q[i]):
print i, q[i]