0

I am given a string of large length (say, 100,000) and an integer k, and I have to compute length of the largest sub-string which repeats in the given string atleast k times.
I found answers to this particular question here and here, but I wanted to know if there is any other efficient method to solve this question other than suffix trees?

Shrey Tripathi
  • 210
  • 2
  • 7
  • 1
    It's less efficient, but easier to understand than suffix trees: do a binary search over answer length, compute all polynomial hashes (https://cp-algorithms.com/string/string-hashing.html) of substrings of such length and see if there's a hash which has _>= k_ occurrences. Computing hashes of all substrings of a fixed length is _O(n)_, binary search has _log n_ iterations, so overall complexity is _O(n log n)_. – fas Apr 15 '20 at 18:33
  • @Fas When doing binary search, how do you move left or right? – nice_dev Apr 15 '20 at 20:15
  • @vivek_23 depending on is there's a hash with _>= k_ occurrences or not. That works, because if there's a string `s` of with _k_ occurrences, `s[:-1]` also has at least _k_ occurrences. – fas Apr 15 '20 at 21:02
  • @fas But how to decide to go left or right? It is quite possible that longer string has >= k occurrences? – nice_dev Apr 15 '20 at 21:04
  • @vivek_23 If we check strings of length _x_ and there's a hash with _>= k_ occurrences, we go left to check if there's a string with length _<= x_ with _>= k_ occurrences, otherwise go right. During binary search we just try to find a maximum number _x_, such as exists string `s` with _>= k_ occurrences and length _x_, sure then exists a string of length _x - 1_ (`s[:-1]` for example), _x - 2_ (`s[:-2]`), ..., 1 (just `s[0]` for example). Because binary search finds maximum, it's impossible that string with length _> x_ has _>= k_ occurrences. – fas Apr 15 '20 at 21:13
  • @fas Not sure if I understand correctly, does `abcabcdefgdefgdefgdefg` for example with k = 3, work with your approach? IMO, the answer for this should be `defg`. – nice_dev Apr 15 '20 at 21:23
  • @ShreyTripathi Do you have link to the problem? – nice_dev Apr 15 '20 at 21:25
  • Also, no response on your previous questions! – nice_dev Apr 15 '20 at 21:26
  • @vivek_23 no, not `defg` in this example, this approach doesn't disallow strings to be intersected, with given approach answer seems to be `defgdefg`. I don't think suffix trees can disallow strings not to be intersected too. – fas Apr 15 '20 at 21:30
  • @fas `defgdefg` comes only twice but k is 3. – nice_dev Apr 15 '20 at 21:36
  • @vivek_23 `defgdefg` comes 3 times: `abcabc<1>defg<2>defg1><3>defg2>defg3>` – fas Apr 15 '20 at 21:38
  • @fas Ahh, yes they overlap also. Here, I start search between 1 and 22, mid is 11. No substring of length 11 seems to satisfy. So, we again go right? – nice_dev Apr 15 '20 at 21:52
  • @vivek_23 Left, search between 1 and 10, mid 5, `defgd` exists for example, right, search between 5 and 10, mid 7, `defgdef` exists for example, right, search between 7 and 10, mid 8, `defgdefg` exists, right, search between 8 and 10, mid 9, no such string exists, left, search between 8 and 8, so this is 8 :) – fas Apr 15 '20 at 21:57
  • @vivek_23, I am really sorry for the late response! I was just studying about suffix trees and came across this problem, and wondered if there was another algorithm to solve this problem. I don't have any link to the problem :( – Shrey Tripathi Apr 16 '20 at 04:52
  • @fas Makes sense but I don't think collecting substrings of length say `m` is an O(1) operation. – nice_dev Apr 16 '20 at 05:57
  • @ShreyTripathi Ok, that's fine. I am pondering on previous questions you asked here on SO. Accept the solution that suits the best(it's ok if you don't accept mine, I just want you to respond) – nice_dev Apr 16 '20 at 05:58
  • 1
    @vivek_23 We don't collect substrings, we collect their polynomial hashes (what is it and how it works you can read here https://cp-algorithms.com/string/string-hashing.html) and this can be done in `O(n)` for all substrings of a fixed by binary search length `m` – fas Apr 16 '20 at 06:47
  • @fas, I didn't quite understand your solution. I can perform binary search only if I know the length of the required substring. I don't know that(I have to find that out); or are you implying that I start taking lengths from 1, 2,....len(string) and create a separate hash table for each one of them? Really sorry for the late reply – Shrey Tripathi Apr 16 '20 at 08:44
  • @fas nice info, I will look into it. – nice_dev Apr 16 '20 at 16:11
  • @ShreyTripathi we're doing binary search to find maximum length (see example for string `abcabcdefgdefgdefgdefg` here https://stackoverflow.com/questions/61235589/longest-substring-repeating-atleast-k-times?noredirect=1#comment108336869_61235589 and in other comments above), we're creating separate hash table during binary search for each `mid` value (which is length we're testing if it's possible to find `k` substrings of such length) in it – fas Apr 16 '20 at 22:31
  • @vivek_23 I wrote a long answer to sum up our discussion, which I hope explains everything. https://stackoverflow.com/a/61261931/3365922 – fas Apr 16 '20 at 23:46

1 Answers1

1

There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times

There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.

1. String polynomial hash

Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.

Let's say we have string s, integers p and mod. Now we can define hash function:

hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod 

where ord is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):

# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")

h[0] = 0
for i in 1..n:
    h[i] = (h[i-1] * p + ord(s[i-1])) % mod

Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow:

# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
    pow[i] = (pow[i-1] * p) % mod

Using arrays h and pow we can easily calculate hash of any substring of the string s:

# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
    value = h[r] - h[l]*pow[r-l]    # line a
    return (value%mod + mod) % mod  # line b

Let's understand why code above works.

h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = (                                          ord(s[l-1])*p^0     + ord(s[l-2])*p^1       + ...) % mod
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As you can see we need only ^^^-part from h[r] so we have to get rid of ~~~-part. ~~~-part in h[r] p^(r-l) times larger than in h[l] and this explains line a.

Line b is kinda magic when operating with % mod, value after line a can be negative, so value%mod + mod definitely makes is positive. At the same time if value was positive after line a value%mod + mod is larger than mod, so (value%mod + mod) % mod will definitely return value in range 0, 1, ..., mod-1.

Finally, mod is a large prime number like 10^9+7 and value is a small number, but larger than any possible ASCII-code like 239 (read in article why so).

Some notes:

  • Hashes may collide, because we have only mod possible values of hash but number of strings is infinite. How to deal with it read in article.
  • Doing operations like h[r] - h[l]*pow[r-l] may require using of 64-bit types of integers.

2. Binary search

Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.

3. Longest substring repeating at least k times solution

Let's say we precalculated arrays h and pow. Let's do binary search to find maximum length ans of string such that there are k or more such substrings in the given string s.

Why binary search works here? Because if there's some length x such as there are k or more equal substrings in s of length x, then there's definitely k or more equal substrings in s of length x-1 (just remove last letter from our matches).

How will binary search work? Let's say we're currently testing if there are k or more equal substrings of length mid. We're going to calculate all hashes of length mid (using get_substring_hash) and we'll make a decision of changing borders of binary search if there is a k equal hashes or not.

For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":

abcabcdefgdefgdefgdefg
      ^^^^^^^^          occurence 1
          ^^^^^^^^      occurence 2
              ^^^^^^^^  occurence 3

Binary search iterations:

l =  1, r = 22, mid = 11. No substring of length 11 satisfy.
l =  1, r = 10, mid =  5. There should be hash("defgd")    be seen 3 times.
l =  5, r = 10, mid =  7. There should be hash("defgdef")  be seen 3 times.
l =  7, r = 10, mid =  8. There should be hash("defgdefg") be seen 3 times.
l =  8, r = 10, mid =  9. No substring of length 9  satisfy.
l =  8, r =  8.           That means answer is 8.

As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map to check if there's a hash with >= k occurrences or not.

I really hope everything is clear now.

fas
  • 1,393
  • 10
  • 20
  • Thank you very much for your answer. I understood the algorithm. Just one thing, what is 'p' here? It can be any number, right? – Shrey Tripathi Apr 17 '20 at 06:44
  • It definitely can be a prime number larger than any number your function `ord` returns (larger than any ASCII-code). I'm not able to say for sure that `p` can be not a prime, seems it would work too – fas Apr 17 '20 at 09:33