Reduce list based off of element substrings

Question

I'm looking for the most efficient way to reduce a given list based off of substrings already in the list.

For example

mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']

would be reduced to:

mylist = ['abcd','qrs']

because both 'abcd' and 'qrs' are the smallest substring of other elements in that list. I was able to do this with about 30 lines of code, but I suspect there is a crafty one-liner out there..

At a high level, it's simple: build a [radix tree](https://en.wikipedia.org/wiki/Radix_tree), then take the direct children of the root (that represent actual elements; a node is just a maximal common prefix of its desendents). In practice, you'll need to track down a decent implementation of a radix tree. [This question](https://stackoverflow.com/questions/4707296/are-there-any-radix-patricia-critbit-trees-for-python) might help you start. — chepner, Jun 13 '17 at 16:59

score 3 · Accepted Answer · answered Jun 13 '17 at 17:22

this seems to be working (but not so efficient i suppose)

def reduce_prefixes(strings):
    sorted_strings = sorted(strings)
    return [element
            for index, element in enumerate(sorted_strings)
            if all(not previous.startswith(element) and
                   not element.startswith(previous)
                   for previous in sorted_strings[:index])]

tests:

>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
                    'qrs', 'qrst', 'qrstu'])
['abcd', 'qrs']
>>>reduce_prefixes(['abcd', 'abcde', 'abcdef',
                    'qrs', 'qrst', 'qrstu',
                    'gabcd', 'gab', 'ab'])
['ab', 'gab', 'qrs']

Pre-sorting the strings is a clever trick, which probably speeds it up considerably compared to my naive solution. — Błotosmętek, Jun 13 '17 at 17:37

Artyer · Answer 2 · 2017-06-13T17:30:34.110

One solution is to iterate over all the strings and split them based on if they had different characters, and recursively apply that function.

def reduce_substrings(strings):
    return list(_reduce_substrings(map(iter, strings)))

def _reduce_substrings(strings):
    # A dictionary of characters to a list of strings that begin with that character
    nexts = {}
    for string in strings:
        try:
            nexts.setdefault(next(string), []).append(string)
        except StopIteration:
            # Reached the end of this string. It is the only shortest substring.
            yield ''
            return
    for next_char, next_strings in nexts.items():
        for next_substrings in _reduce_substrings(next_strings):
            yield next_char + next_substrings

This splits it into a dictionary based on the character, and tries to find the shortest substring out of those that it split into a different list in the dictionary.

Of course, because of the recursive nature of this function, a one-liner wouldn't be possible as efficiently.

score 0 · Answer 3 · answered Jun 13 '17 at 17:26

Probably not the most efficient, but at least short:

mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']

outlist = []
for l in mylist:
    if any(o.startswith(l) for o in outlist):
        # l is a prefix of some elements in outlist, so it replaces them
        outlist = [ o for o in outlist if not o.startswith(l) ] + [ l ]
    if not any(l.startswith(o) for o in outlist):
        # l has no prefix in outlist yet, so it becomes a prefix candidate
        outlist.append(l)

print(outlist)

score -1 · Answer 4 · answered Jun 13 '17 at 17:12

-1

Try this one:

import re
mylist = ['abcd','abcde','abcdef','qrs','qrst','qrstu']
new_list=[]
for i in mylist:
    if re.match("^abcd$",i):
        new_list.append(i)
    elif re.match("^qrs$",i):
        new_list.append(i)
print(new_list)
#['abcd', 'qrs']

answered Jun 13 '17 at 17:12

dildeolupbiten

1,314
1
15
27

this assumes the values of the list are known. Values will be unknown and the values must not have other items in the list that are substrings of that item – Chris Hall Jun 13 '17 at 17:18
I got it. Thank you. – dildeolupbiten Jun 13 '17 at 17:25

Reduce list based off of element substrings

4 Answers4