How could I make this sub-string finder more efficient and cleaner?

Question

This was the first solution that came into my mind, but I can't really imagine another one.

strings = [input("String: ") for i in range(int(input("How many strings?")))]
num = 0

for i, subA in enumerate(strings):
    for j,subB in enumerate(strings):
        if (i != j) and (subB in subA):
            num += 1
print("There are", num, "substrings")

If you want to see performance boost, here is your option: `num = sum(starmap(contains, combinations(strings, 2)))`. Imports: [`contains()`](https://docs.python.org/3/library/operator.html#operator.contains), [`combinations()`](https://docs.python.org/3/library/itertools.html#itertools.combinations), [`starmap()`](https://docs.python.org/3/library/itertools.html#itertools.starmap). You can use it as replacement of your nested loop. — Olvin Roght, Sep 25 '21 at 13:31
I'm sorry, there's a mistake in example above, you should use [`permutations()`](https://docs.python.org/3/library/itertools.html#itertools.permutations) instead of [`combinations()`](https://docs.python.org/3/library/itertools.html#itertools.combinations). `num = sum(starmap(contains, permutations(strings, 2)))` — Olvin Roght, Sep 25 '21 at 13:52
@OlvinRoght , thanks a lot, now I just need to understand how it actually works — , Sep 25 '21 at 14:07

Olvin Roght · Accepted Answer · 2021-09-25T15:05:34.557

In your implementation you're reinventing itertools.permutations() which returns permutations of elements with given length. You can compare lists generated with nested for loop (from your code sample) and permutations():

from itertools import permutations

strings = ["a", "aa", "b"]
res = []
for i, subA in enumerate(strings):
    for j, subB in enumerate(strings):
        if i != j:
            res.append((subA, subB))
print("Nested loop:", res)
res = list(permutations(strings, 2))
print("permutations():", res)

You need to check whether each of element is a substring of another element, so you can iterate over pairs returned from permutations() and test does first element contain second (or vise versa). Let's do it with simple list comprehension:

from itertools import permutations

strings = ["a", "aa", "b"]
res = [a in b for a, b in permutations(strings, 2)]
# will return [True, False, False, False, False, False]

In python True is 1 and False is 0 (docs). So to count how many strings are substrings we can pass a generator expression into a sum().

from itertools import permutations

strings = ["a", "aa", "b"]
num = sum(a in b for a, b in permutations(strings, 2))
# will return 1

You can also use itertools.starmap() to call operator.contains() (same as a in b) with every pair returned by permutations().

from operator import contains
from itertools import permutations, starmap

strings = ["a", "aa", "b"]
num = sum(starmap(contains, permutations(strings, 2)))

Here is a bit improved version of your code:

from operator import contains
from itertools import permutations, starmap

count = input("How many strings? ")
if count.isdecimal() and (count := int(count)):
    strings = []
    while count:
        item = input(f"String ({count} left): ")
        if item:  # skip empty strings
            strings.append(item)
            count -= 1
    num = sum(starmap(contains, permutations(strings, 2)))
    print("There", "are" if num > 1 else "is", num or "no",
          "substring" + "s" * (num != 1))
else:
    print(f'"{count}" is not a valid positive number')

P.S. Some notes on performance.

Because of the method how sum() process iterable you can make some patches on code with generator expression to work faster.

sum([1 for a, b in permutations(strings, 2) if a in b])

will be slightly faster than

sum(a in b for a, b in permutations(strings, 2))

Why? Take a look on next questions:

score 0 · Answer 2 · answered Sep 25 '21 at 13:25

0

strings = [input("String: ") for i in range(int(input("How many strings?")))]
num = 0

for i, subA in enumerate(strings[:-1]):
    for subB in strings[i+1:]:
        if subB in subA or subA in subB:
            num += 1
print("There are", num, "substrings")

Now, I'm not exactly sure that I understand what you're trying to do, but this ensures that each string is compared to all others only once. It's untested so don't take my word for it though.

I'm assuming that, given an array of strings, you want to find how many of these strings are contain the any other string as substring. Note that if this is the case there is a the question of what to do with identical strings in the list.

answered Sep 25 '21 at 13:25

Johan Kuylenstierna

189
5

The idea is to check if the string "b" is contained in the string "a", and it can't be the same string, so the code has to check every string in the array for every string, that's why I'm trying to make it faster – Sep 25 '21 at 13:38
the identical strings will count as a sub-string, and using the i and j indexes we'd prevent the same item for counting itself – Sep 25 '21 at 13:51
Yes, I get around that by only comparing each string to the subsequent strings in the list, as well as comparing those strings with the string in question. The complexity of your code is O(n^2) and mine should be O(nlogn) – Johan Kuylenstierna Sep 25 '21 at 14:02
The output of our algorithms will differ only if the list of strings contains identical strings. Say if you have the list ["a", "b", "c", "abaa", "aca", "u", "caa", "abracadabra", "a", "b"], in your code the first "a" will be compared to the last "a" AND vice versa, thus triggering two additions to the count, in my algorithm it will only be added once as both are compared simultaneously. If you prefer it your way you can changing it to num += 1 + (subB in subA and subA in subB). – Johan Kuylenstierna Sep 25 '21 at 14:23
So that's were I was getting different answers, and that extra addition is actually part of the answer, since the last "a" is part/equal the first "a". – Sep 25 '21 at 14:36

How could I make this sub-string finder more efficient and cleaner?

2 Answers2