23

I'm about to write a function which, would return me a shortest period of group of letters which would eventually create the given word.

For example word abkebabkebabkeb is created by repeated abkeb word. I would like to know, how efficiently analyze input word, to get the shortest period of characters creating input word.

Farsan Rashid
  • 1,460
  • 4
  • 17
  • 28
jack44
  • 293
  • 1
  • 2
  • 5

13 Answers13

22

Here is a correct O(n) algorithm. The first for loop is the table building portion of KMP. There are various proofs that it always runs in linear time.

Since this question has 4 previous answers, none of which are O(n) and correct, I heavily tested this solution for both correctness and runtime.

def pattern(inputv):
    if not inputv:
        return inputv

    nxt = [0]*len(inputv)
    for i in range(1, len(nxt)):
        k = nxt[i - 1]
        while True:
            if inputv[i] == inputv[k]:
                nxt[i] = k + 1
                break
            elif k == 0:
                nxt[i] = 0
                break
            else:
                k = nxt[k - 1]

    smallPieceLen = len(inputv) - nxt[-1]
    if len(inputv) % smallPieceLen != 0:
        return inputv

    return inputv[0:smallPieceLen]
Buge
  • 476
  • 3
  • 14
  • 1
    So is this a solution you have come up with or is this a known algorithm? – davidbuttar Nov 25 '15 at 23:53
  • 4
    Well [KMP is a known algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm). This question was very similar to a homework problem I had, and this is the answer I came up with for the homework. The instructor's solution was a bit different, but also used KMP. – Buge Nov 27 '15 at 07:24
  • Hi Buge, love your solution, and vote up. but confused by this line `smallPieceLen = len(inputv) - nxt[-1]`, and `nxt[-1]` means if the whole string does not match, what index we will be used to compare next. `smallPieceLen` represents the differences total length of string and `nxt[-1]`, how it could be represented as shortest repetitive string? – Lin Ma Oct 27 '16 at 22:38
  • 2
    @LinMa: (Buge wasn't active lately) `nxt[-1] means if the whole string does not match, what index we will be used to compare next` no (contorted grammar, btw.). It is the index to compare next when all of the pattern matched and you want to find its next occurrence in a longer text. `nxt[i] = p` means `pattern[i+1-p:i+1]` equals `pattern[0:p]` (& != for `p+1`). `nxt[-1]` is the index to compare next if the "first" mismatch was "at `len`+1". (In many a presentation/implementation of KMP, there is a special value of -1 at index 0, with the n values as above "shifted to an index higher by one".) – greybeard Oct 28 '16 at 09:37
  • Hi @greybeard, spend time today to debug your comments and Buge's code carefully, and you are correct, thanks for clarifying the meaning of `nxt[i]=j`. There is one thing I still mis-understand, `len(inputv) - nxt[-1]`, this line checks the longest prefix, which matches suffix of the same string, `smallPieceLen ` is the remaining part length, why if `len(inputv)` could be divided by `smallPieceLen`, then it is shortest repetitive string? Is there a simple prove or some explanation? Is it too aggressive to make the conclusion? – Lin Ma Oct 30 '16 at 22:44
  • Maybe this is a question to both @Buge and greybeard. :) – Lin Ma Oct 30 '16 at 22:44
  • 1
    @LinMa: (`both` are notified, anyway) Let me call `len(inputv)` _len_ and `nxt[-1]` _matchLen_. If _matchLen_ < _smallPieceLen_, the only chance for _smallPieceLen_ to divide _len_ is to be equal to it. If _smallPieceLen_ ≤ _matchLen_, `inputv[0:smallPieceLen]` equals `inputv[smallPieceLen:2*smallPieceLen]`, and `k` never got reset (again): inputv is made up of repetitions of `inputv[0:smallPieceLen]` - the divisibility check just ensures that it ends with a full repetition. – greybeard Oct 31 '16 at 08:37
  • @greybeard, nice answer, I think about your reply for more than one day, but not sure how do you get this conclusion `inputv[0:smallPieceLen] equals inputv[smallPieceLen:2*smallPieceLen]`, would you elaborate a bit more? – Lin Ma Nov 02 '16 at 06:53
  • The way I think about it is `nxt[-1]` is the max amount that the beginning of `inputv` can overlap with the end of itself (without overlapping the entire thing). For example in `abcabcabc` `nxt[-1]==6` because the last 6 letters are the same as the first 6. I specifically like to think of it as the length of the last characters. So in the example it can be split into 2 parts: `abc|abcabc` where the first part has length `smallPieceLen` and the second part has length `nxt[-1]`. Because of the overlap knowledge, we know `abc` overlaps at the beginning with `abcabc` as far as the overlap or `abc – Buge Dec 05 '16 at 06:39
  • Can someone translate this answer to a known programming language like java or C? It would be much appreciated to someone like me who doesn't understand this one. Thank you. – Axel Sep 23 '17 at 06:38
  • what does: nxt = [0]*len(inputv) do? – Axel Sep 24 '17 at 12:14
  • It's python which I consider a known programming language, the same language as the accepted answer. `[0]*len(inputv)` [means](https://stackoverflow.com/q/3459098/830749) create a python list (array) with the same length as `inputv` and with all elements being `0`. Anyway, [here](https://ideone.com/ViZKCX) it is in Java. C would be a little annoying because of its less flexible strings and arrays. – Buge Sep 24 '17 at 23:35
  • This doesn't work the way I expected if the pattern ends abruptly. For example `[1, 2, 1]` should return `[1, 2]` but this code returns `[1, 2, 1]`. – Boris Verkhovskiy Feb 09 '19 at 23:26
  • 1
    @Boris I consider `[1, 2]` wrong. The question says "For example word `abkebabkebabkeb` is created by repeated `abkeb` word." But it's not possible to create `[1, 2, 1]` by repeating `[1, 2]`. You need to repeat it, then chop off the end to get `[1, 2, 1]`. But yes, the question is somewhat ambiguous about what the correct answer to `[1, 2, 1]` is. – Buge Feb 11 '19 at 01:05
2

This is an example for PHP:

<?php
function getrepeatedstring($string) {
    if (strlen($string)<2) return $string;
    for($i = 1; $i<strlen($string); $i++) {
        if (substr(str_repeat(substr($string, 0, $i),strlen($string)/$i+1), 0, strlen($string))==$string)
            return substr($string, 0, $i);
    }
    return $string;
}
?>
SamB
  • 9,039
  • 5
  • 49
  • 56
AndersTornkvist
  • 2,610
  • 20
  • 38
  • This returns 'abkeb' which should be correct but I'm not sure in what way the OP is asking for 'kebab' rather than 'abkeb'. – pimvdb May 16 '11 at 18:12
  • This is what I'm looking for. But it runs in O(n). Any ideas if this can be speeded up? – jack44 May 16 '11 at 18:18
  • 2
    @jack44: You can't know if you have the shortest cycle until you've examined the entire string. Unless you have other knowledge, like what the largest possible cycle might be. It might be that the last character in the string throws the whole cycle off, you don't know. – mellamokb May 16 '11 at 18:27
  • 4
    I don't know PHP, but this looks like it's O(n^2). – dfb May 16 '11 at 18:29
  • @Richard86 - String comparison is going to O(n), though, isn't it? – dfb May 16 '11 at 21:04
  • @jack44 The speed could be improved. One solution is to start at pos `strlen($string)/2`. Then, if it is a match, you would go to `strlen($string)/4`, `strlen($string)/8`, and so on, until it is not a match, after that, you would check each char from the current position until match. If it isn't a match at pos `strlen($string)/2`, you would check at pos `strlen($string)-strlen($string)/2`, if not a match, continue with `strlen($string)-strlen($string)/4`, `strlen($string)-strlen($string)/8` and so on until a match, and then, check from one step back. I'll use PHP if no one tries with pseudo. – AndersTornkvist May 16 '11 at 21:06
  • @jack44 I believe the O(n) solution by @spin is to prefer, and there is no need to improve this answer as it will not be able to be as fast as the O(n) solution is. – AndersTornkvist May 16 '11 at 21:13
2

O(n) solution. Assumes that the entire string must be covered. The key observation is that we generate the pattern and test it, but if we find something along the way that doesn't match, we must include the entire string that we already tested, so we don't have to reobserve those characters.

def pattern(inputv):
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            print pattern_end
            return inputv[0:pattern_end+1];
    return inputv;
SamB
  • 9,039
  • 5
  • 49
  • 56
dfb
  • 13,133
  • 2
  • 31
  • 52
  • Is `for pattern_end in range(len(inputv)/2)` necessary? I don't think it is. – Ishtar May 16 '11 at 18:55
  • @Ishtar - sorry I'm not following. Do you mean the look of the len()/2 part – dfb May 16 '11 at 18:56
  • I mean, replacing that line with `pattern_end = 0`. – Ishtar May 16 '11 at 18:59
  • 11
    I'm afraid the algorithm is incorrect. Please consider the input: "BCBDBCBCBDBC". The smallest repeating pattern is "BCBDBC", but the algorithm above will miss it. Also, I think it doesn't deal correctly with the case "HELLOHELL" (where it returns "HELLO" instead of the complete string). – Eyal Schneider Feb 03 '13 at 11:03
  • @EyalSchneider returning "HELLO" for "HELLOHELL" makes perfect sense. Otherwise you could be using the length of the string to determine which patterns are even allowed. You only need to check patterns whose lengths are combinations of the prime factors (and 1) of the length of the input. For example if you get a string that's 101 characters long what's the point of even trying a 5 letter subsequence? It's not going to fit perfectly into 101 characters. If your input's length is 13 there's no point in even checking, no pattern (except a 1 letter pattern) repeated n times can ever add up to 13. – Boris Verkhovskiy Feb 10 '19 at 00:06
  • 2
    @Boris: The problem is finding the smallest sub-sequence of S such that K>=1 repetitions of it would result in S itself. The input "HELLOHELL" has no repeating subsequence with K>1, so "HELLOHELL" should be returned. – Eyal Schneider Feb 10 '19 at 23:01
  • This is a correct answer. not the most voted one. This answer is also work for "aabbccaa" and find "aabbcc" but the top answer doesn't work and print "aabbccaa". – sajjad jafari Nov 14 '22 at 05:03
1

Most easiest one in python:

def pattern(self, s):
    ans=(s+s).find(s,1,-1)
    return len(pat) if ans == -1 else ans
Pranav
  • 11
  • 1
1

Simpler answer which I can come up in an interview is just a O(n^2) solution, which tries out all combinations of substring starting from 0.

int findSmallestUnit(string str){
    for(int i=1;i<str.length();i++){
        int j=0;
        for(;j<str.length();j++){
            if(str[j%i] != str[j]){
                break;
            }
        }
        if(j==str.length()) return str.substr(0,i);
    }
    return str;
}

Now if someone is interested in O(n) solution to this problem in c++:

  int findSmallestUnit(string str){
      vector<int> lps(str.length(),0);
      int i=1;
      int len=0;

      while(i<str.length()){
          if(str[i] == str[len]){
              len++;
              lps[i] = len;
              i++;
          }
          else{
              if(len == 0) i++;
              else{
                  len = lps[len-1];
              }
          }
      }
      int n=str.length();
      int x = lps[n-1];
      if(n%(n-x) == 0){
          return str.substr(0,n-x);    
      }
      return str;
  }

The above is just @Buge's answer in c++, since someone asked in comments.

Manish Madugula
  • 139
  • 1
  • 9
0

I believe there is a very elegant recursive solution. Many of the proposed solutions solve the extra complexity where the string ends with part of the pattern, like abcabca. But I do not think that is asked for.

My solution for the simple version of the problem in clojure:

 (defn find-shortest-repeating [pattern string]
  (if (empty? (str/replace string pattern ""))
   pattern
   (find-shortest-repeating (str pattern (nth string (count pattern))) string)))

(find-shortest-repeating "" "abcabcabc") ;; "abc"

But be aware that this will not find patterns that are uncomplete at the end.

Ward
  • 2,802
  • 1
  • 23
  • 38
0

I found a solution based on your post, that could take an incomplete pattern:

(defn find-shortest-repeating [pattern string]
   (if (or (empty? (clojure.string/split string (re-pattern pattern)))
          (empty? (second (clojure.string/split string (re-pattern pattern)))))
    pattern
    (find-shortest-repeating (str pattern (nth string (count pattern))) string)))
  • @ward `(defn find-pattern-string [string] (let [pattern "" working-str string] (reduce #(if (not (or (empty? (clojure.string/split string (re-pattern %1))) (empty? (second (clojure.string/split string (re-pattern %1)))))) (str %1 %2) %1) pattern working-str)))` – Alfonso Fernando Alvarez Oct 12 '17 at 14:53
0

My Solution: The idea is to find a substring from the position zero such that it becomes equal to the adjacent substring of same length, when such a substring is found return the substring. Please note if no repeating substring is found I am printing the entire input String.

public static void repeatingSubstring(String input){
    for(int i=0;i<input.length();i++){
        if(i==input.length()-1){
            System.out.println("There is no repetition "+input);
        }
        else if(input.length()%(i+1)==0){
            int size = i+1;
            if(input.substring(0, i+1).equals(input.substring(i+1, i+1+size))){
                System.out.println("The subString which repeats itself is "+input.substring(0, i+1));
                break;
            }
        }
    }
}
hdalali
  • 1
  • 1
0

Regex solution:

Use the following regex replacement to find the shortest repeating substring, and only keeping that substring:

^(.+?)\1*$
$1

Explanation:

^(.+?)\1*$
^        $   # Start and end, to match the entire input-string
 (   )       # Capture group 1:
  .+         #  One or more characters,
    ?        #  with a reluctant instead of greedy match†
      \1*    # Followed by the first capture group repeated zero or more times

$1           # Replace the entire input-string with the first capture group match,
             # removing all other duplicated substrings

Greedy vs reluctant would in this case mean: greedy = consumes as many characters as it can; reluctant = consumes as few characters as it can. Since we want the shortest repeating substring, we would want a reluctant match in our regex.

Example input: "abkebabkebabkeb"
Example output: "abkeb"

Try it online in Retina.

Here an example implementation in Java.

Kevin Cruijssen
  • 9,153
  • 9
  • 61
  • 135
  • This won't work for the string `abaa` where expected output is `a` – Himanshu Poddar Jun 28 '22 at 16:55
  • @HimanshuPoddar If I read the question correctly, the full word is always build from multiple repeating substrings without any other characters, so `abab` (2x `ab`) or `aaa` (3x `a`) are both valid inputs, but `abaa` not (unless you count it as 1x `abaa`). – Kevin Cruijssen Jun 28 '22 at 17:08
  • 1
    Oh yeah, my bad I confused between questions, was reading similar question on the same lines. Thanks for the reply tho – Himanshu Poddar Jun 28 '22 at 17:32
0

This is a solution I came up with using the queue, it passed all the test cases of a similar problem in codeforces. Problem No is 745A.

#include<bits/stdc++.h>
using namespace std;
typedef long long ll;

int main()
{
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    string s, s1, s2; cin >> s; queue<char> qu; qu.push(s[0]); bool flag = true; int ind = -1;
    s1 = s.substr(0, s.size() / 2);
    s2 = s.substr(s.size() / 2);
    if(s1 == s2)
    {
        for(int i=0; i<s1.size(); i++)
        {
            s += s1[i];
        }
    }
    //cout << s1 << " " << s2 << " " << s << "\n";
    for(int i=1; i<s.size(); i++)
    {
        if(qu.front() == s[i]) {qu.pop();}
        qu.push(s[i]);
    }
    int cycle = qu.size();

    /*queue<char> qu2 = qu; string str = "";
    while(!qu2.empty())
    {
        cout << qu2.front() << " ";
        str += qu2.front();
        qu2.pop();
    }*/


    while(!qu.empty())
    {
        if(s[++ind] != qu.front()) {flag = false; break;}
        qu.pop();
    }
    flag == true ? cout << cycle : cout << s.size();
    return 0;
}

Nguyễn Văn Phong
  • 13,506
  • 17
  • 39
  • 56
-1

Super delayed answer, but I got the question in an interview, here was my answer (probably not the most optimal but it works for strange test cases as well).

private void run(String[] args) throws IOException {
    File file = new File(args[0]);
    BufferedReader buffer = new BufferedReader(new FileReader(file));
    String line;
    while ((line = buffer.readLine()) != null) {
        ArrayList<String> subs = new ArrayList<>();
        String t = line.trim();
        String out = null;
        for (int i = 0; i < t.length(); i++) {
            if (t.substring(0, t.length() - (i + 1)).equals(t.substring(i + 1, t.length()))) {
                subs.add(t.substring(0, t.length() - (i + 1)));
            }
        }
        subs.add(0, t);
        for (int j = subs.size() - 2; j >= 0; j--) {
            String match = subs.get(j);
            int mLength = match.length();
            if (j != 0 && mLength <= t.length() / 2) {
                if (t.substring(mLength, mLength * 2).equals(match)) {
                    out = match;
                    break;
                }
            } else {
                out = match;
            }
        }
        System.out.println(out);
    }
}

Testcases:

abcabcabcabc
bcbcbcbcbcbcbcbcbcbcbcbcbcbc
dddddddddddddddddddd
adcdefg
bcbdbcbcbdbc
hellohell

Code returns:

abc
bc
d
adcdefg
bcbdbc
hellohell

MMP
  • 546
  • 2
  • 5
  • 19
  • 1
    Just looking at the first for loop this is O(n^2), because each .equals() can take n time. – Buge Nov 22 '15 at 19:58
-1

Works in cases such as bcbdbcbcbdbc.

function smallestRepeatingString(sequence){
  var currentRepeat = '';
  var currentRepeatPos = 0;

  for(var i=0, ii=sequence.length; i<ii; i++){
    if(currentRepeat[currentRepeatPos] !== sequence[i]){
      currentRepeatPos = 0;
      // Add next character available to the repeat and reset i so we don't miss any matches inbetween
      currentRepeat = currentRepeat + sequence.slice(currentRepeat.length, currentRepeat.length+1);
      i = currentRepeat.length-1;
    }else{
      currentRepeatPos++;
    }
    if(currentRepeatPos === currentRepeat.length){
      currentRepeatPos = 0;
    }
  }

  // If repeat wasn't reset then we didn't find a full repeat at the end.
  if(currentRepeatPos !== 0){ return sequence; }

  return currentRepeat;
}
davidbuttar
  • 676
  • 4
  • 12
  • 1
    This is actually O(n^2). That is because you reset `i` to be smaller with `i = currentRepeat.length-1;`. So with a 10 character string ling 'aaaaaaaaab' it takes 46 iterations. With a 20 character string it takes 191 iterations. – Buge Nov 22 '15 at 20:22
-1

I came up with a simple solution that works flawlessly even with very large strings.
PHP Implementation:

function get_srs($s){
    $hash = md5( $s );
    $i = 0; $p = '';

    do {
        $p .= $s[$i++];
        preg_match_all( "/{$p}/", $s, $m );
    } while ( ! hash_equals( $hash, md5( implode( '', $m[0] ) ) ) );

    return $p;
}
  • 1
    Would be good if you gave some detail regarding why exactly this works. Providing more detail helps the entire community and helps to get more up votes. – Charlie Fish Sep 15 '16 at 20:48