Check if list A contains a prefix of an item in list B

Question

I have two list, which we can call A and B. I need to check the items in list A and see if items in B starts with an item from A and then stop the check.

Example of content in A:

https://some/path
http://another/path
http://another.some/path

Example of content in B:

http://another/path
http://this/wont/match/anything

Currently I'm doing this:

def check_comps(self, comps):
   for a in self.A:
      for b in comps:
         if b.startswith(a):
            return a

Is there a better way to do this?

@MosesKoledoye that doesn't work. I'm trying to see if contents in B startswith anything from A. I'm not looking looking for exact matches — nullByteMe, Jul 21 '16 at 14:53
What you've got will work, but..... it would also identify matches where the string `b` was in the string 'a' but not necessarily at the beginning. You might want to import `re' and use `if re.match(b, a):` as the test condition to trigger the return of a. `re.match` only matches things at the bigining of the string. — R.Sharp, Jul 21 '16 at 14:55
@R.Sharp I know it will work, I'm trying to find out if there is a better more effecient way of doing this — nullByteMe, Jul 21 '16 at 14:56
@free_mind Check if this does what you want: `{i for i in A for j in comps if j.startswith(i)}` — Moses Koledoye, Jul 21 '16 at 15:17
The example provided by OP was not the best since the item returned was actually identical. Something like directories for `B` and subdirectories for `A` would be much better. — Ma0, Jul 21 '16 at 15:20
@Ev.Kounis for some reason it's unable to determine `comps` from this scope — nullByteMe, Jul 21 '16 at 15:27
Extending Moses Koledoye's solution, if you want it to only return the first time a match is found, you might want something like: [i for i in A for j in comps if j.startswith(i)] [0] — Yupsiree, Jul 21 '16 at 15:29
Although you'd probably want a check to make sure it is not an empty list returning False. — Yupsiree, Jul 21 '16 at 15:31
@Ev.Kounis thanks, and I understand that. What I'm saying is python doesn't recognize that variable in that scope. It recognizes it fine without doing it his way. This is a class object. — nullByteMe, Jul 21 '16 at 15:36
@free_mind You should put this in your class method since `comps` comes as a parameter — Moses Koledoye, Jul 21 '16 at 15:48
@MosesKoledoye I guess that's quite a bit less efficient than the OP's solution. The OP's solution is O(1) in the best case and O(n^2) in the worst, while yours is stable O(n^2). — Eli Korvigo, Jul 24 '16 at 21:03
@EliKorvigo How to post an efficient solution in a *comment*? :P Nice solution BTW — Moses Koledoye, Jul 24 '16 at 21:24

Eli Korvigo · Accepted Answer · 2016-07-25T08:44:27.620

Your solution has the worst-case O(nm) time complexity, that is O(n^2) if n ~ m. You can easily reduce it to O(n log(n)) and even O(log(n)). Here is how.

Consider a list of words (your comps attrubute) and a target (your b)

words = ['abdc', 'abd', 'acb', 'abcabc', 'abc']
target = "abcd"

Observe, that by sorting the list of words in lexicographical order, you get a list of prefixes

prefixes = ['abc', 'abcabc', 'abd', 'abdc', 'acb']

It is degenerate, because prefixes[0] is a prefix of prefixes[1], hence everything that starts with prefixes[1] starts with prefixes[0] just as well. This is a bit problematic. Let's see why. Let's use the fast (binary) search to find the proper place of the target in the prefix list.

import bisect


bisect.bisect(prefixes, target)  #  -> 2

This is because the target and prefixes[1] share a prefix, but target[3] > prefixes[1][3], hence lexicographically it should go after. Hence, if there is a prefix of the target in the prefixes, it should be to the left of index 2. Obviously, the target doesn't start with prefixes[1] hence in the worst case we would have to search all the way to the left to find whether there is a prefix. Now observe, that if we transform these prefixes into a nondegenerate list, the only possible prefix of a target will always be to the left of the position returned by bisect.bisect. Let's reduce the list of prefixes and write a helper function that will check whether there is a prefix of a target.

from functools import reduce


def minimize_prefixes(prefixes):
    """
    Note! `prefixes` must be sorted lexicographically !
    """
    def accum_prefs(prefixes, prefix):
        if not prefix.startswith(prefixes[-1]):
            return prefixes.append(prefix) or prefixes
        return prefixes
    prefs_iter = iter(prefixes)
    return reduce(accum_prefs, prefs_iter, [next(prefs_iter)]) if prefixes else []


def hasprefix(minimized_prefixes, target):
    position = bisect.bisect(minimized_prefixes, target)
    return target.startswith(minimized_prefixes[position-1]) if position else False

Now let's see

min_prefixes = minimize_prefixes(prefixes)
print(min_prefixes)  # -> ['abc', 'abd', 'acb']
hasprefix(min_prefixes, target)  # -> True

Let's make a test that must fail:

min_prefs_fail = ["abcde"]
hasprefix(min_prefs_fail, target)  # -> False

This way you get O(n log(n)) search, which is asymptotically faster than your O(n^2) solution. Note! You can (and you really should) store the minimize_prefixes(sorted(comps)) prefix set as an attribute in your object, making any prefix search O(log (n)), which is even more faster than what you have now.

Check if list A contains a prefix of an item in list B

1 Answers1