String subpattern recognition optimization

Question

In this kata you need to build a function to return either true/True or false/False if a string can be seen as the repetition of a simpler/shorter subpattern or not.

For example:

has_subpattern("a") == False #no repeated pattern
has_subpattern("aaaa") == True #created repeating "a"
has_subpattern("abcd") == False #no repeated pattern
has_subpattern("abababab") == True #created repeating "ab"
has_subpattern("ababababa") == False #cannot be entirely reproduced repeating a pattern

Strings will never be empty and can be composed of any character (just consider upper- and lowercase letters as different entities) and can be pretty long (keep an eye on performances!).

My solution is:

def has_subpattern(string):
    string_size = len(string)

    for i in range(1, string_size):
        slice1 = string[:i]
        appearence_count = string.count(slice1)
        slice1_len = len(slice1)
        if appearence_count > 0:
            if appearence_count * slice1_len == string_size:
                return True

    return False

Obviously there are weak and too slow things like slice1 = string[:i] and string.count() in loop.. Is there better ways to solve an issue or ways to improve performance ?

@norok2 that `.count()` is loop over `string` variable each time in `for` loop, if string is 30k chars it's tooo expensive to do — Vlad Davidchenko, Oct 23 '19 at 07:45

score 2 · Answer 1 · answered Oct 22 '19 at 16:10

2

Short regex approach:

import re

def has_subpattern_re(s):
    return bool(re.search(r'^(\w+)\1+$', s))

It'll provide a close (to initial has_subpattern approach) performance on small strings:

import timeit

...
print(timeit.timeit('has_subpattern("abababab")', 'from __main__ import has_subpattern'))
0.7413144190068124
print(timeit.timeit('has_subpattern_re("abababab")', 'from __main__ import re, has_subpattern_re'))
0.856149295999785

But, a significant performance increase (in about 3-5 times faster) on long strings:

print(timeit.timeit('has_subpattern("ababababababababababababababababababababababababa")', 'from __main__ import has_subpattern'))
14.669428467008402
print(timeit.timeit('has_subpattern_re("ababababababababababababababababababababababababa")', 'from __main__ import re, has_subpattern_re'))
4.308312018998549

And one more test for a more longer string:

print(timeit.timeit('has_subpattern("ababababababababababababababababababababababababaababababababababababababababababababababababababab")', 'from __main__ import has_subpattern'))
35.998031173992786
print(timeit.timeit('has_subpattern_re("ababababababababababababababababababababababababaababababababababababababababababababababababababab")', 'from __main__ import re, has_subpattern_re'))
7.010367843002314

answered Oct 22 '19 at 16:10

RomanPerekhrest

88,541
4
65
105

1

With some simple modifications (essentially skipping impossible values of `i`) you should be able to get approx. on par with the `re`-based solution even with explicit looping for this input size. – norok2 Oct 22 '19 at 18:28
1

For *MUCH* larger inputs, this gets to be much slower than the optimized loop approach. – norok2 Oct 23 '19 at 09:38
@norok2, You are talking too much about your "optimized loop approach". I made comparisons to the initial OP's approach. And I did not say that there could not be another more faster approach than mine. – RomanPerekhrest Oct 23 '19 at 09:56
Nor did I implied that you said anything beyond what is in the answer. I thought it is worth noting that this is approach may or may not be faster than OP approach, and simple modifications thereof will be more performing than this. Nevertheless, this answer is a very helpful contribution to the problem. – norok2 Oct 23 '19 at 10:05
@norok2, I think we need to accept the reality that there could be *alternatives*. "Let it be" – RomanPerekhrest Oct 23 '19 at 10:07

norok2 · Accepted Answer · 2019-10-23T12:42:51.517

Within standard Python, the bottlenecks here will be count, which enjoys C speed implementation and the looping. The looping itself may be hard to speed-up (althogh Cython may be of some help).

Hence, the most important optimization is to reduce the number of loopings.

One obvious way is to let range() do not exceed half the size of the input (+ 2: + 1 for rounding issues, + 1 for end extrema exclusion in range()):

Also, string is a standard Python module, so better not use it as a variable name.

def has_subpattern_loop(text):
    for i in range(1, len(text) // 2 + 2):
        subtext = text[:i]
        num_subtext = text.count(subtext)
        if num_subtext > 1 and num_subtext * len(subtext) == len(text):
            return True
    return False

A much more effective way of restricting the number of calls to count is to skip computation when i is not a multiple of the length of the input.

def has_subpattern_loop2(text):
    for i in range(1, len(text) // 2 + 2):
        if len(text) % i == 0:
            subtext = text[:i]
            num_subtext = text.count(subtext)
            if num_subtext > 1 and num_subtext * len(subtext) == len(text):
                return True
    return False

Even better would be to generate only the divisors of the length of the input. This could be done using sympy and the approach outlined here:

import sympy as sym
import functools


def get_divisors(n):
    if n == 1:
        yield 1
        return
    factors = list(sym.factor_.factorint(n).items())
    nfactors = len(factors)
    f = [0] * nfactors
    while True:
        yield functools.reduce(lambda x, y: x * y, [factors[x][0]**f[x] for x in range(nfactors)], 1)
        i = 0
        while True:
            f[i] += 1
            if f[i] <= factors[i][1]:
                break
            f[i] = 0
            i += 1
            if i >= nfactors:
                return


def has_subpattern_divs(text):
    for i in get_divisors(len(text)):
        subtext = text[:i]
        num_subtext = text.count(subtext)
        if num_subtext > 1 and num_subtext * len(subtext) == len(text):
            return True
    return False

A completely different approach is the one proposed in @ВладДавидченко answer:

def has_subpattern_find(text):
    return (text * 2).find(text, 1) != len(text)

or the more memory efficient (requires ~50% less additional memory compared to has_subpattern_find2()):

def has_subpattern_find2(text):
    return (text + text[:len(text) // 2 + 2]).find(text, 1) > 0

and it is based on the idea that if there is a exactly self-repeating string, the string itself must be found in a circularly extended string:

Input:      abab
Extension1: abababab
Found1:     |-abab
Extension2: ababab
Found2:     |-abab

Input:      ababa
Extension1: ababaababa
Found1:     |----ababa
Extension2: ababab
Found2:     NOT FOUND!

The find-based method are the fastest, with has_subpattern_find() being fastest in the small input size regime, and has_subpattern_find2() gets generally faster in the intermediate and large input size regime (especially in the False case). For shorter inputs, the direct looping approaches (especially has_subpattern_loop2()) are fastest, closely followed (but sometimes surpassed by has_subpattern_re()), but as soon as the input gets bigger (and especially for the False outcome), the has_subpattern_divs() method gets to be the fastest (aside of find-based ones) by far and large, as shown by the following benchmarks. For the True outcome, has_subpattern_loop2() gets to be the fastest due to the very small number of loops required, which is independent of the input size.

The input is generated as a function of n using:

def gen_input(n, m=0):
    g = string.ascii_lowercase
    if not m:
        m = n
    offset = '!' if n % 2 else ''
    return g[:n] * (m // min(n, len(g)) + 2) + offset

so that if n is even, the has_subpattern*() always return True and the opposite for odd n. Note that, in general, the has_subpattern() function will depend not only on the raw size of the input but also on the length of the repeating string, if any. This is not explored in the benchmarks, except for the odd/even separation.

Even Inputs

Odd Inputs

(Full code available here).

(EDITED to include some more solutions as well as comparison with regex-based solution from @RomanPerekhrest)

(EDITED to include some more solutions based on the find from @ВладДавидченко)

score 1 · Answer 3 · answered Oct 23 '19 at 08:42

1

Found another one solution, probably will be useful:

def has_subpattern(string):
    return (string * 2).find(string, 1) != len(string)

answered Oct 23 '19 at 08:42

Vlad Davidchenko

454
1
7
24

1

That will be fast, but not so memory efficient. – norok2 Oct 23 '19 at 08:58
@norok2 for False cases where it's needed to fully go throw string? – Vlad Davidchenko Oct 23 '19 at 09:43
1

No, the memory inefficiency is due to the need of creating a temporary array twice the size of the input when you do `s * 2` and it is independent of the final outcome. Anyway I am updating my answer with to include also this and a slightly more memory efficient version of it in the benchmarks. Also, implementing this in Cython with a modulo-ed index over `s` for the `find` operation will likely be the fastest and you will get back on business with optimal memory efficiency. – norok2 Oct 23 '19 at 09:47

String subpattern recognition optimization

3 Answers3

Linked