1

I have a set of queries, where some are merely portions of the eventual search string. I need to clean the partial strings from a very long collection of queries. Is a fast way to do this across potentially millions of sets like this?

t = {u'house prices',
 u'how ',
 u'how man',
 u'how many animals go ex',
 u'how many animals go extinted eac',
 u'how many animals go extinted each ',
 u'how many species go',
 u'how many species go extin',
 u'how many species go extinet each yea',
 u'how many species go extinet each year?'}

I would like to retain only:

t = {u'house prices',
 u'how many species go extinet each year?',
 u'how many animals go extinted each '}

Here's the solution from @Alex Hall, edited to catch the final string (the concatenation of '-+-' does this)

# Print out the unique strings
q = sorted(list(t)) + ['-+-']
for i in range(len(q) - 1):
    if not q[i+1].startswith(q[i]):
        print i, q[i]
zbinsd
  • 4,084
  • 6
  • 33
  • 40
  • Sets only work with identities based on hash values, but two very similar strings have a very different hash value (by design), so having a set will not give you any benefit here. You still have to loop through everything, and probably set up your own index instead. – poke Oct 09 '15 at 17:20
  • 1
    what happened to `"how many animals go..."`? – Filip Roséen - refp Oct 09 '15 at 17:20

2 Answers2

4

Sort the set to make a list q, then iterate through it and build up a new list of elements where not q[i+1].startswith(q[i]). Should do the trick reasonably well.

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • This is a beautifully simple solution, but for long overlapping strings, the `q[i] not in q[i+1]` will take a lot of time. – Aasmund Eldhuset Oct 09 '15 at 17:30
  • In CPython it should be extremely quick. I think long strings would be a much bigger problem for inserting into a trie as far as constant factors go, especially if the implementation of the trie is written in python. Besides the nature of the strings suggests they won't get too long. – Alex Hall Oct 09 '15 at 17:47
  • Err - I have apparently not woken up today, and I realize now that your solution is superior (somehow, I thought for a while that you'd compare _all_ pairs of strings in this manner). +1. – Aasmund Eldhuset Oct 09 '15 at 18:10
  • Thanks @AlexHall, I edited my question to show your elegant answer, adding a nonsense element to the list for the end case. – zbinsd Oct 09 '15 at 19:15
3

Edit: Alex Hall's solution is better.

For each set, create a new trie and insert all the set's strings into it. In the resulting trie, the leaf nodes represent the strings that are not prefixes of any other strings. With a good trie implementation, the runtime is expected to be linear in the sum of the length of the strings.

Aasmund Eldhuset
  • 37,289
  • 4
  • 68
  • 81