Language Independant: Check if a string consists of a multiple of a certain substring

Question

I want the general algorithm to find if a string contains a repeating pattern, and no part of the string is left out of the repeating pattern.

For example, look at these sample strings:

abcabcabc - true
abcabcabcx - false
cucumbercucumber - true
cucumber - false
abaaabaaabaa - true

I looked at this answer, which solved the problem for a few cases, but would fail in case of the cucumber example. I need something that would work in all cases.

Interesting question indeed. What I have learned so far is that for every `regex` there is a `finite automata`. The first question should be `Is there a finite autonata for dynamic pattern recognition?`. If so, than you could convert it to regex and vice versa. — TheRealVira, Oct 05 '16 at 05:43
If backreferences are supported, it can be done by regex: `^(.+)\1+$` — Sebastian Proske, Oct 05 '16 at 06:26
Explain why this is example true: abaaabaaabaa What do you think the repeating pattern is? — Ira Baxter, Oct 05 '16 at 08:19
@SebastianProske: Ah. Yes. Well. This is why you want an algorithm instead of a human being to do this. Or at least why you want an algorithm rather than me doing it :-} — Ira Baxter, Oct 05 '16 at 08:37

score 5 · Accepted Answer · edited May 23 '17 at 12:06

5

A Python solution inspired by https://stackoverflow.com/a/2553533/1763356 is

s in (s + s)[1:-1]

This takes O(n) time assuming an efficient implementation of str.__contains__.

edited May 23 '17 at 12:06

Community

1
1

answered Oct 07 '16 at 04:42

Veedrac

58,273
15
112
169

Maybe I don't understand what this does. What does this do with s == "a"? – Ira Baxter Oct 07 '16 at 05:18
1

`s + s == "aa"`, so `(s + s)[1:-1] == ""` (the empty string). Thus `s in (s + s)[1:-1]` returns `False`. – Veedrac Oct 07 '16 at 05:51
Ah, it is zero origin indexing. (I'm not a Python expert). Interesting. So this idea should be implementable in any language with string concat, substring and find. – Ira Baxter Oct 07 '16 at 05:56

Ira Baxter · Answer 2 · 2016-10-05T08:39:25.020

This seems like the obvious way to do it:

String s = "abaaabaabaa" ; // string to test

for (int repeating_pattern_length=1; 
     repeating_pattern_length<=s.length/2;
     repeating_pattern_length++)
{  if (modulo(s.length,repeating_pattern_length)==0)
   { // can fit exactly N times
     String proposed_subpattern=s.substring(0,repeating_pattern_length);
     for (nth_instance=2; // don't need to check 1st occurrence
          nth_instance<=s.length/repeating_pattern_length;
          nth_instance++)
     { // check nth occurrence
       if (!proposed_subpattern.equal(
           s.substring((nth_instance-1)*repeating_pattern_length,
                       repeating_pattern_length)
          cycle repeating_pattern_length; // nth occurrence doesn't match
     }
     return true;
   }
}
return false;

[Untested. This is intended to be Java, but I'm not an expert Java coder. Forgive my trespasses].

This arguably has complexity O(s.length) with a small constant factor.

One might consider building a suffix tree (also linear time) and then checking that the tree has the appropriate cycles. I suspect the above algorithm is pretty good in practice.

score 1 · Answer 3 · answered Oct 06 '16 at 08:17

Since you aren't asking for a specific language, I'd recommend looking into the Rosetta Code page for Repeating String. You can find and study a bunch of algorithms solving the problem. Though the problem is stated for 1s and 0s in Rosetta Code, most solutions should work with any possible strings.

I've written a general Common Lisp recursive solution, here the commented code:

(ql:quickload :alexandria)
(defun rep-stringv (a-str &optional (max-rotation (floor (/ (length a-str) 2))))
  ;; Exit condition if no repetition found.
  (cond ((< max-rotation 1) "Not a repeating string")
        ;; Two checks:
        ;; 1. Truncated string must be equal to rotation by repetion size.
        ;; 2. Remaining chars (rest-str) are identical to starting chars (beg-str)
        ((let* ((trunc (* max-rotation (truncate (length a-str) max-rotation)))
                (truncated-str (subseq a-str 0 trunc))
                (rest-str (subseq a-str trunc))
                (beg-str (subseq a-str 0 (rem (length a-str) max-rotation))))
           (and (string= beg-str rest-str)
                (string= (alexandria:rotate (copy-seq truncated-str) max-rotation)
                         truncated-str)))
         ;; If both checks pass, return the repeting string.
         (subseq a-str 0 max-rotation))
        ;; Recurse function reducing length of rotation.
        (t (rep-stringv a-str (1- max-rotation)))))

Testing:

CL-USER> (rep-stringv "cucumber")
"Not a repeating string"
CL-USER> (rep-stringv "abaaabaaabaa")
"abaa"

The best possible solution can be achieved with a suffix tree for the string, as you probably already now - since it's a common problem described everywhere, e.g., Wikipedia.

Implementing it seems overkill to me, unless you really need performance. In any case, examples of suffix trees (in many languages) can be found here.

Frank Puffer · Answer 4 · 2016-10-08T15:40:46.603

Here's some elementary C++ code that does the job:

bool IsRepeating( std::string in ) {

    int totalLength = in.length();
    for (int subLength = 1; subLength <= totalLength / 2; subLength++ ) {
        if (totalLength % subLength != 0) continue;

        for (int startPos = 0; startPos < subLength; startPos++) {
            char startChar =in[startPos];
            bool mismatchFound = false;
            for (int delta = subLength; delta < totalLength-startPos; delta += subLength) {
                if (in[startPos+delta] != startChar ) {
                    mismatchFound = true;
                    break;
                }
            }
            if (mismatchFound) {
                break;
            }
            return true;
        }
    }
    return false;
}

It makes use of the fact that the substring length has to be a divisor of the total string length.

The worst-case time complexity is pretty bad, something like O(n^2 log(log(n))), but I'm not sure. (Worst case is when the string consists of exactly two identical substrings.) Still I believe that on average it should perform quite well because most of the outer loop body is only executed for divisors of the string length and the inner loops are aborted as soon as a mismatch is found.

Edit: The solution by @Veedrac is not only much more elegant but also more performant in most cases. For a direct comparison, here's the C++ version:

bool IsRepeating( const std::string& in ) {
    if (in.length() < 1) return false;
    return (in + in).substr(1, 2 * in.length() - 2).find(in) != std::string::npos;
}

It does however use more memory. And if you don't know the purpose of the function, it might be tough to figure it out. But that also holds for my original version.

Language Independant: Check if a string consists of a multiple of a certain substring

4 Answers4