Most common substring of length X

Question

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.

For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).

Does an algorithm exist that does this in O(n) time?

you misspelled "lenght" and "exemple". Correct is "length" and "example" — John Scipione, Oct 20 '09 at 20:22
This is very similar to the problem of finding the [most frequent phrase](https://stackoverflow.com/questions/29753618/find-most-repeated-phrase-on-huge-text) in a text. — Anderson Green, Jun 15 '22 at 18:46

Keith Randall · Accepted Answer · 2009-10-22T22:58:57.603

You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.

Update:

Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.

If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.

Now we only need a randomized pairwise independent rolling hash algorithm.

Update2:

Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.

That's the kind of hash I was thinking of -- thanks for the link. For more brain hurting: You have to assume that collisions happened, and check each of them manually, so now you're at O(n + X*(number of possible collisions)) which may or may not be more than O(n) — Ian Clelland, Oct 20 '09 at 22:37
I'm confused why not use hashmap, hashtable or hash function in java library since that's already well built. — user2372074, Sep 29 '14 at 16:07
@user2372074: because the standard algorithm for this (hash all ~n substrings of length X) takes O(nX) time. That is worse than this algorithm which is O(n). — Keith Randall, Sep 30 '14 at 02:57
@Keith Randall just need be clear, for standard algorithm, should O(nX) be the space complexity? — user2372074, Sep 30 '14 at 03:31
@user2372074: no, you can do it in O(n) words, or O(n lg n) bits. — Keith Randall, Sep 30 '14 at 04:46
A properly designed hash table will handle collisions for you, so you don't even need to rack your brain about it. — qwr, Feb 18 '20 at 23:03

Ian Clelland · Answer 2 · 2009-10-20T22:44:12.243

I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.

(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)

It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.

Update

Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).

Clellan why we need reinvent the hash function. Can you just go with bulit in hashing? — user2372074, Sep 29 '14 at 18:17
Why do you assume there will be many collisions? A proper rolling hash would not have that issue in average case. — qwr, Feb 18 '20 at 22:57
And why do you need to store the starting position? Do you not trust your hash table to keep counts? A properly designed hash table, whether open addressing or separate chaining, will have constant time insertions. This includes rehashing. — qwr, Feb 18 '20 at 23:02

Chris Ballance · Answer 3 · 2009-10-20T20:56:05.353

1

It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)

Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.

edited Oct 20 '09 at 20:56

answered Oct 20 '09 at 20:20

Chris Ballance

33,810
26
104
151

2

Given that there are /(n*(n-1)/ substrings of a string of length /n/, I don't see how your roughly-described algorithm can be O(n) -- your iteration step should be at least O(N^2) – Ian Clelland Oct 20 '09 at 20:26
1

@Ian Clelland: `X` is fixed, isn't it? – jfs Oct 20 '09 at 20:38
1

Absolutely right -- I missed that in the problem description. – Ian Clelland Oct 20 '09 at 20:42
'X' is fixed, one iteration = O(n) – Chris Ballance Oct 20 '09 at 20:47
2

It's still not really O(N) -- At each of the N positions, you need to look at the next M characters (e.g. produce a hash of M characters), so the complexity is O(NM). Treating it as O(N) simply means you're assuming M is so small that its contribution is negligible. – Jerry Coffin Oct 20 '09 at 20:52
I think this answer would be a lot better if it included some pseudocode. – Joren Oct 20 '09 at 21:08
@Chris: I should add, however, that I think O(NM) is probably as good as anybody can do. Given N characters, you have to look at all N-M possible substrings, and to decide whether two substrings are equal, you have to compare all M characters of those strings. – Jerry Coffin Oct 20 '09 at 21:13
@Jerry Coffin: there are sub-linear algorithms for searching substrings e.g., [pdf] http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf – jfs Oct 20 '09 at 21:45
Yes, searching for a substring can be sublinear -- but in this case you can't use that, because you're interested in *every* possibly substring, you just have to determine which one you're dealing with at each step in the larger string. – Jerry Coffin Oct 21 '09 at 13:27

jfs · Answer 4 · 2009-10-20T20:58:51.180

0

Naive solution in Python

from collections import defaultdict
from operator    import itemgetter

def naive(s, X):
    freq = defaultdict(int)
    for i in range(len(s) - X + 1):
        freq[s[i:i+X]] += 1
    return max(freq.iteritems(), key=itemgetter(1))

print naive("aoaoa", 3)
# -> ('aoa', 2)

In plain English

Create mapping: substring of length X -> how many times it occurs in the s string
```
for i in range(len(s) - X + 1):
    freq[s[i:i+X]] += 1
```
Find a pair in the mapping with the largest second item (frequency)
```
max(freq.iteritems(), key=itemgetter(1))
```

edited Oct 20 '09 at 20:58

answered Oct 20 '09 at 20:35

jfs

399,953
195
994
1,670

complexity of freq[s[i:i+X]] is O(n)... This is a O(n^2) algorithm – qjkx Oct 20 '09 at 20:58
Looks like a homework question, which is why I did not provide a code example. – Chris Ballance Oct 20 '09 at 21:00
@qjkx: complexity of `freq[s[i:i+X]] is O(X)... This is a O(n*X) algorithm. – jfs Oct 20 '09 at 21:30

John Scipione · Answer 5 · 2009-10-20T22:51:16.233

Here is a version I did in C. Hope that it helps.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
    unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;

    string = "aoaoa";
    n = 3;

    for (i = 0; i <= (strlen(string) - n); i++) {
        tmpstr = (char *)malloc(n + 1);
        strncpy(tmpstr, string + i, n);
        *(tmpstr + (n + 1)) = '\0';
        for (j = 0; j <= (strlen(string) - n); j++) {
            tmpstr2 = (char *)malloc(n + 1);
            strncpy(tmpstr2, string + j, n);
            *(tmpstr2 + (n + 1)) = '\0';
            if (!strcmp(tmpstr, tmpstr2))
                matchcount++;
        }
        if (matchcount > maxcount) {
            maxstring = tmpstr;
            maxcount = matchcount;
        }
        matchcount = 0;
    }

    printf("max string: \"%s\", count: %d\n", maxstring, maxcount);

    free(tmpstr);
    free(tmpstr2);

    return 0;
}

score 0 · Answer 6 · answered Oct 21 '09 at 10:58

You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.

In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.

Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.

We carry on in this fashion until we reach the end of the string.

A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).

a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()

If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).

To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.

The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).

But how exactly do you add all the substrings to your data structure (which appears to be a trie)? — qwr, Feb 18 '20 at 23:06
By traversing the trie taking one letter at a time from the substring in question. If you reach a leaf node before reaching the end of the substring, then you start adding nodes to the trie, one for each letter. — Guillermo Phillips, Feb 18 '20 at 23:20
what is the substring in question? unless you use some kind of sliding window — qwr, Feb 19 '20 at 02:32
You would have to extract all substrings. Start at the first character and fill in the trie one letter at a time until you reach the end of the string. Then move on to the second character, and rinse and repeat. The trie is simply a record keeping mechanism. — Guillermo Phillips, Feb 19 '20 at 11:34

score 0 · Answer 7 · answered Apr 22 '22 at 15:53

Python-3 Solution:

from collections import Counter
list = []
list.append([string[i: j] for i in range(len(string)) for j in range(i + 1, len(string) + 1) if len(string[i:j]) == K]) # Where K is length
# now find the most common value in this list
# you can do this natively, but I prefer using collections
most_frequent = Counter(list).most_common(1)[0][0]
print(most_freqent)

Here is the native way to get the most common (for those that are interested):

most_occurences = 0
current_most = ""
for i in list:
  frequency = list.count(i)
  if frequency > most_occurences:
    most_occurences = frequency
    current_most = list[i]
print(f"{current_most}, Occurences: {most_occurences}")
[Extract K length substrings (geeks for geeks)][1]


  [1]: https://www.geeksforgeeks.org/python-extract-k-length-substrings/

score -1 · Answer 8 · answered Oct 20 '09 at 22:13

-1

LZW algorithm does this

This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.

LZW on Wikipedia

answered Oct 20 '09 at 22:13

Robert Koritnik

103,639
52
277
404

I believe LZW is O(n) to decode, but slower than that on the encode, which is what the OP was looking for. – Dean J Oct 21 '09 at 14:54
LZW itself is yes because it looks for all length repeated substrings. I was just saying that he could use a similar principle but looking for fixed length strings. – Robert Koritnik Oct 21 '09 at 15:42

score -2 · Answer 9 · answered Oct 20 '09 at 21:31

-2

There's no way to do this in O(n).

Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

answered Oct 20 '09 at 21:31

Dean J

39,360
16
67
93

Most common substring of length X

9 Answers9

Naive solution in Python

In plain English

Python-3 Solution:

LZW algorithm does this

Linked