1

I'm looking for the best most efficient way to match the end of a single string with a value from a predefined list of strings.
Something like

my_str='QWERTY'
my_lst=['QWE','QQQQ','TYE','YTR','TY']  

match='TY' or match=['TY']

Under the restrictions

len(my_lst) is known but arbitrary thus could be very long, probably around 30
elements in my_lst may have different len so I can't just check a defined last portion of my_str every time
for my_str as well as the matching elements in my_lst they can be either strings or lists, whichever is more efficient (see background)
len(my_str) is mostly small, no longer than 8 characters
in function wouldn't do as I need the matching to occur exclusively at the end.
endswith is no use on it's own since it would only return a Boolean
the match should always be unique or [] as no elements in my_lst would share ending with one another

little background may skip
I started with this problem as a list problem such as ['Q','W','E','R','T','Y'] where I would have a list of lists of 1 character strings for the matching and I was thinking of running a reverse iteration as [::-1] for the checking for every candidate.
Then I realized it was possible to concatenate the inner lists since they contained only strings and run the same logic on the resulting strings.
Finally I came across the endswith string method reading this question but it wasn't quite what I needed. Furthermore my problem can't be generalized to be solved with os module or similar since it's a string problem, not a pathing one.
end of background
I made my approach in this two ways

match=filter(lambda x: my_str.endswith(x), my_lst)
match=[x for x in my_lst if my_str.endswith(x)]

I succeeded but I would like to know if there is some built-in or best way to find and return the matched ending value.

Thanks.

aydow
  • 3,673
  • 2
  • 23
  • 40
Penicilina
  • 13
  • 3
  • Are you actually trying to find all of the matching suffixes, even though there's only ever going to be zero or one, or just find out whether there _is_ a match? – abarnert Jul 12 '18 at 00:45
  • 1
    If the latter, just store your suffixes in a tuple instead of a list and you can do `my_str.endswith(my_tup)`. (Or at least short-circuit the search with `any(my_str.endswith(x) for x in my_lst)`.) – abarnert Jul 12 '18 at 00:46
  • I think the most efficent way (assuming you have to check against many string of input and many potential suffixes) would be to build a trie, or prefix tree, from the suffixes in reverse, then check the end of the string in reverse to see if you can reach one of the leaves of that trie. – Patrick Haugh Jul 12 '18 at 00:52
  • Can you make it so the strings are all reversed and `my_lst` is sorted outside this function? – Mad Physicist Jul 12 '18 at 00:57
  • It's no use to pass in the casted tuple into the endswith because I need to retrieve the match when it exists. As for the short-circuit I might actually try that. About the tree I just couldn't make it look good in my head. And yes I can manipulate quite freely all the strings before processing them, reversing or sorting them inside the list is ok. Also cast the list back and forth from list to tuples and sets if needed. – Penicilina Jul 12 '18 at 01:18
  • @Penicilina. I meant whether you can get the list sorted from whatever the original source is. Sorting will make your algorithmic complexity nlogn instead of n like it should be. Having everything reversed and presorted will make your complexity logn. – Mad Physicist Jul 12 '18 at 01:53
  • Oh. Yes I can. Can also get it casted before hand as I mentioned. – Penicilina Jul 12 '18 at 02:00

1 Answers1

1

Here's a way using a trie, or prefix tree (technically a suffix tree in this situation). If we had three potential suffixes CA, CB, and BA, our suffix tree would look like

     e
    / \
  A     B
 / \    |
B   C   C

(e is the empty string) We start at the end of the input string and consume characters. If we run across the beginning of the string or a character that is not a child of the current node, then we reject the string. If we reach a leaf of the tree, then we accept the string. This lets us scale better to very many potential suffixes.

def build_trie(suffixes):
    head = {}
    for suffix in suffixes:
        curr = head
        for c in reversed(suffix):
            if c not in curr:
                curr[c] = {}
            curr = curr[c]
    return head

def is_suffix(trie, s):
    if not trie:
        return True
    for c in reversed(s):
        try:
            trie = trie[c]
        except KeyError:
            return False
        if not trie:
            return True
    return False

trie = build_trie(['QWE','QQQQ','TYE','YTR','TY'])

gives us a trie of

{'E': {'W': {'Q': {}}, 
       'Y': {'T': {}}},
 'Q': {'Q': {'Q': {'Q': {}}}},
 'R': {'T': {'Y': {}}},
 'Y': {'T': {}}}

If you want to return the matching suffix, that's just a matter of tracking the characters we see as we descend the trie.

def has_suffix(trie, s):
    if not trie:
        return ''
    letters = []
    for c in reversed(s):
        try:
            trie = trie[c]
            letters.append(c)
        except KeyError:
            return None
        if not trie:
            return ''.join(letters)
    return None

It's worth noting that the empty trie can be reached by both build_trie(['']) and build_trie([]), and matches the empty string at the end of all strings. To avoid this, you could check the length of suffixes and return some non-dict value, which you would check against in has_suffix

Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
  • 1
    it's worth mentioning that this is only useful if you have multiple values of `my_str`. Otherwise a decent implementation of `endswith` combined with `filter` and `next` will of course be much faster. – Mad Physicist Jul 12 '18 at 02:57
  • Yes. The (potentially high) cost of building the trie is amortized across all the subsequent searches. Potentially, if your program will only ever search for the same large list of suffixes, you can generate a JSON of the trie and deploy it with your program, meaning there's no start up cost for the end user beyond reading a file. – Patrick Haugh Jul 13 '18 at 14:31