I am given a string of large length (say, 100,000) and an integer k, and I have to compute length of the largest sub-string which repeats in the given string atleast k times.
I found answers to this particular question here and here, but I wanted to know if there is any other efficient method to solve this question other than suffix trees?

- 210
- 2
- 7
-
1It's less efficient, but easier to understand than suffix trees: do a binary search over answer length, compute all polynomial hashes (https://cp-algorithms.com/string/string-hashing.html) of substrings of such length and see if there's a hash which has _>= k_ occurrences. Computing hashes of all substrings of a fixed length is _O(n)_, binary search has _log n_ iterations, so overall complexity is _O(n log n)_. – fas Apr 15 '20 at 18:33
-
@Fas When doing binary search, how do you move left or right? – nice_dev Apr 15 '20 at 20:15
-
@vivek_23 depending on is there's a hash with _>= k_ occurrences or not. That works, because if there's a string `s` of with _k_ occurrences, `s[:-1]` also has at least _k_ occurrences. – fas Apr 15 '20 at 21:02
-
@fas But how to decide to go left or right? It is quite possible that longer string has >= k occurrences? – nice_dev Apr 15 '20 at 21:04
-
@vivek_23 If we check strings of length _x_ and there's a hash with _>= k_ occurrences, we go left to check if there's a string with length _<= x_ with _>= k_ occurrences, otherwise go right. During binary search we just try to find a maximum number _x_, such as exists string `s` with _>= k_ occurrences and length _x_, sure then exists a string of length _x - 1_ (`s[:-1]` for example), _x - 2_ (`s[:-2]`), ..., 1 (just `s[0]` for example). Because binary search finds maximum, it's impossible that string with length _> x_ has _>= k_ occurrences. – fas Apr 15 '20 at 21:13
-
@fas Not sure if I understand correctly, does `abcabcdefgdefgdefgdefg` for example with k = 3, work with your approach? IMO, the answer for this should be `defg`. – nice_dev Apr 15 '20 at 21:23
-
@ShreyTripathi Do you have link to the problem? – nice_dev Apr 15 '20 at 21:25
-
Also, no response on your previous questions! – nice_dev Apr 15 '20 at 21:26
-
@vivek_23 no, not `defg` in this example, this approach doesn't disallow strings to be intersected, with given approach answer seems to be `defgdefg`. I don't think suffix trees can disallow strings not to be intersected too. – fas Apr 15 '20 at 21:30
-
@fas `defgdefg` comes only twice but k is 3. – nice_dev Apr 15 '20 at 21:36
-
@vivek_23 `defgdefg` comes 3 times: `abcabc<1>defg<2>defg1><3>defg2>defg3>` – fas Apr 15 '20 at 21:38
-
@fas Ahh, yes they overlap also. Here, I start search between 1 and 22, mid is 11. No substring of length 11 seems to satisfy. So, we again go right? – nice_dev Apr 15 '20 at 21:52
-
@vivek_23 Left, search between 1 and 10, mid 5, `defgd` exists for example, right, search between 5 and 10, mid 7, `defgdef` exists for example, right, search between 7 and 10, mid 8, `defgdefg` exists, right, search between 8 and 10, mid 9, no such string exists, left, search between 8 and 8, so this is 8 :) – fas Apr 15 '20 at 21:57
-
@vivek_23, I am really sorry for the late response! I was just studying about suffix trees and came across this problem, and wondered if there was another algorithm to solve this problem. I don't have any link to the problem :( – Shrey Tripathi Apr 16 '20 at 04:52
-
@fas Makes sense but I don't think collecting substrings of length say `m` is an O(1) operation. – nice_dev Apr 16 '20 at 05:57
-
@ShreyTripathi Ok, that's fine. I am pondering on previous questions you asked here on SO. Accept the solution that suits the best(it's ok if you don't accept mine, I just want you to respond) – nice_dev Apr 16 '20 at 05:58
-
1@vivek_23 We don't collect substrings, we collect their polynomial hashes (what is it and how it works you can read here https://cp-algorithms.com/string/string-hashing.html) and this can be done in `O(n)` for all substrings of a fixed by binary search length `m` – fas Apr 16 '20 at 06:47
-
@fas, I didn't quite understand your solution. I can perform binary search only if I know the length of the required substring. I don't know that(I have to find that out); or are you implying that I start taking lengths from 1, 2,....len(string) and create a separate hash table for each one of them? Really sorry for the late reply – Shrey Tripathi Apr 16 '20 at 08:44
-
@fas nice info, I will look into it. – nice_dev Apr 16 '20 at 16:11
-
@ShreyTripathi we're doing binary search to find maximum length (see example for string `abcabcdefgdefgdefgdefg` here https://stackoverflow.com/questions/61235589/longest-substring-repeating-atleast-k-times?noredirect=1#comment108336869_61235589 and in other comments above), we're creating separate hash table during binary search for each `mid` value (which is length we're testing if it's possible to find `k` substrings of such length) in it – fas Apr 16 '20 at 22:31
-
@vivek_23 I wrote a long answer to sum up our discussion, which I hope explains everything. https://stackoverflow.com/a/61261931/3365922 – fas Apr 16 '20 at 23:46
1 Answers
There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times
There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.
1. String polynomial hash
Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.
Let's say we have string s
, integers p
and mod
. Now we can define hash function:
hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod
where ord
is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):
# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")
h[0] = 0
for i in 1..n:
h[i] = (h[i-1] * p + ord(s[i-1])) % mod
Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow
:
# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
pow[i] = (pow[i-1] * p) % mod
Using arrays h
and pow
we can easily calculate hash of any substring of the string s
:
# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
value = h[r] - h[l]*pow[r-l] # line a
return (value%mod + mod) % mod # line b
Let's understand why code above works.
h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = ( ord(s[l-1])*p^0 + ord(s[l-2])*p^1 + ...) % mod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see we need only ^^^
-part from h[r]
so we have to get rid of ~~~
-part. ~~~
-part in h[r]
p^(r-l)
times larger than in h[l]
and this explains line a.
Line b is kinda magic when operating with % mod
, value
after line a can be negative, so value%mod + mod
definitely makes is positive. At the same time if value
was positive after line a value%mod + mod
is larger than mod
, so (value%mod + mod) % mod
will definitely return value in range 0, 1, ..., mod-1.
Finally, mod
is a large prime number like 10^9+7 and value
is a small number, but larger than any possible ASCII-code like 239 (read in article why so).
Some notes:
- Hashes may collide, because we have only
mod
possible values of hash but number of strings is infinite. How to deal with it read in article. - Doing operations like
h[r] - h[l]*pow[r-l]
may require using of 64-bit types of integers.
2. Binary search
Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.
3. Longest substring repeating at least k times solution
Let's say we precalculated arrays h
and pow
. Let's do binary search to find maximum length ans
of string such that there are k
or more such substrings in the given string s
.
Why binary search works here? Because if there's some length x
such as there are k
or more equal substrings in s
of length x
, then there's definitely k
or more equal substrings in s
of length x-1
(just remove last letter from our matches).
How will binary search work? Let's say we're currently testing if there are k
or more equal substrings of length mid
. We're going to calculate all hashes of length mid
(using get_substring_hash
) and we'll make a decision of changing borders of binary search if there is a k
equal hashes or not.
For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":
abcabcdefgdefgdefgdefg
^^^^^^^^ occurence 1
^^^^^^^^ occurence 2
^^^^^^^^ occurence 3
Binary search iterations:
l = 1, r = 22, mid = 11. No substring of length 11 satisfy.
l = 1, r = 10, mid = 5. There should be hash("defgd") be seen 3 times.
l = 5, r = 10, mid = 7. There should be hash("defgdef") be seen 3 times.
l = 7, r = 10, mid = 8. There should be hash("defgdefg") be seen 3 times.
l = 8, r = 10, mid = 9. No substring of length 9 satisfy.
l = 8, r = 8. That means answer is 8.
As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map
to check if there's a hash with >= k occurrences or not.
I really hope everything is clear now.

- 1,393
- 10
- 20
-
Thank you very much for your answer. I understood the algorithm. Just one thing, what is 'p' here? It can be any number, right? – Shrey Tripathi Apr 17 '20 at 06:44
-
It definitely can be a prime number larger than any number your function `ord` returns (larger than any ASCII-code). I'm not able to say for sure that `p` can be not a prime, seems it would work too – fas Apr 17 '20 at 09:33