Algorithm to find the most common substrings in a string

Question

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have "hello world" as its most common two-word phrase:

"hello world this is hello world. hello world repeats three times in this string!"

In the string above, the most common string (after the empty string character, which repeats an infinite number of times) would be the space character .

Is there any way to generate a list of common substrings in this string, from most common to least common?

Define what do you mean by phrase, the substring `"l"` is more common then `"hello world"`. And obviously `"hello"` is at least as common as `"hello world"`. — amit, Feb 03 '13 at 08:17
Then the most common substring is the empty string (repeats infinite number of times). The second after that is the most common character. Finding it can be easily done using a [histogram](http://en.wikipedia.org/wiki/Histogram) in `O(n)`. — amit, Feb 03 '13 at 08:18
@AndersonGreen: I don't think you really understood amit's comment. You write, "In the string above, the most common string would be `hello world`"; but the substring `hello world` occurs only three times, whereas the substring `l` occurs nine times. (And `is` occurs four times. And `" "` occurs fifteen times.) — ruakh, Feb 03 '13 at 08:21
@amit After finding the most common character, I would try to find the most common two-character strings that began with each character. Then I would try to find the most common three-character strings that began with the most common two-character strings, and so on. — Anderson Green, Feb 03 '13 at 08:21
"" is the most common substring, it repeats infinite number of times. — amit, Feb 03 '13 at 08:23
http://en.wikipedia.org/wiki/Suffix_tree#Functionality and http://en.wikipedia.org/wiki/Suffix_array if you consider big string at 10000 or 100000 scale. — nhahtdh, Feb 03 '13 at 09:03
"hello world" answer suggests that you want [to find the longest duplicate substring](http://stackoverflow.com/questions/13560037/effcient-way-to-find-longest-duplicate-string-for-python-from-programming-pearl) — jfs, Feb 03 '13 at 09:14
@J.F.Sebastian then does it mean that the longest string which appears twice? What about this example `"AAAAAAAAAAAAAAAAAAAAAAA hello world AAAAAAAAAAAAAAAAAAAAAAA hello world hello world"`. Here the most occurring substring is 'A' and the longest substring (occurring at least twice) is `"AAAAAAAAAAAAAAAAAAAAAAA"`. What should my function return? — ShuklaSannidhya, Feb 03 '13 at 14:25
@SandyLee_user53167: [it returns `"AA..."`](http://ideone.com/FoJm2K) that is correct **if** the answer to the initial question is `"hello world"` and **not** [`"l"` that is the most common substring after `" "`](http://ideone.com/bZIOpW). — jfs, Feb 03 '13 at 14:44

Audrius Meškauskas · Accepted Answer · 2013-02-03T15:00:21.203

This is as task similar to Nussinov algorithm and actually even simpler as we do not allow any gaps, insertions or mismatches in the alignment.

For the string A having the length N, define a F[-1 .. N, -1 .. N] table and fill in using the following rules:

  for i = 0 to N
    for j = 0 to N
      if i != j
        {
          if A[i] == A[j]
            F[i,j] = F [i-1,j-1] + 1;
          else
            F[i,j] = 0;
        }

For instance, for B A O B A B:

AlgChart

This runs in O(n^2) time. The largest values in the table now point to the end positions of the longest self-matching subquences (i - the end of one occurence, j - another). In the beginning, the array is assumed to be zero-initialized. I have added condition to exclude the diagonal that is the longest but probably not interesting self-match.

Thinking more, this table is symmetric over diagonal so it is enough to compute only half of it. Also, the array is zero initialized so assigning zero is redundant. That remains

  for i = 0 to N
    for j = i + 1 to N
      if A[i] == A[j]
         F[i,j] = F [i-1,j-1] + 1;

Shorter but potentially more difficult to understand. The computed table contains all matches, short and long. You can add further filtering as you need.

On the next step, you need to recover strings, following from the non zero cells up and left by diagonal. During this step is also trivial to use some hashmap to count the number of self-similarity matches for the same string. With normal string and normal minimal length only small number of table cells will be processed through this map.

I think that using hashmap directly actually requires O(n^3) as the key strings at the end of access must be compared somehow for equality. This comparison is probably O(n).

Not sure this answers the question. This is an easy method to find the longest self-matching substrings. The question is for the most common self-matching substring. — G. Bach, Feb 03 '13 at 14:08
I have added explanation this can be easily done during the string recovery stage. The algorithm is efficient if only strings above the certain threshold are interesting. — Audrius Meškauskas, Feb 03 '13 at 14:30
Note that you can compute the matrix without actually allocating any extra memory by iterating down through diagonals (instead of rows or cols) and do any analysis of the resulting numbers on the fly without saving them. This will save you O(n^2) of memory. — NightElfik, Jul 07 '20 at 22:02
There are several [sequential pattern mining](https://en.wikipedia.org/wiki/Sequential_pattern_mining) algorithms that can be used to find frequent substrings in a string. — Anderson Green, Feb 14 '22 at 06:39
You can also use a [rolling hash function](https://stackoverflow.com/a/1597556/975097) to find the most common substring with a specific length. — Anderson Green, Apr 10 '22 at 16:49

Jim Gray · Answer 2 · 2013-02-03T08:53:35.110

6

Python. This is somewhat quick and dirty, with the data structures doing most of the lifting.

from collections import Counter
accumulator = Counter()
text = 'hello world this is hello world.'
for length in range(1,len(text)+1):
    for start in range(len(text) - length):
        accumulator[text[start:start+length]] += 1

The Counter structure is a hash-backed dictionary designed for counting how many times you've seen something. Adding to a nonexistent key will create it, while retrieving a nonexistent key will give you zero instead of an error. So all you have to do is iterate over all the substrings.

edited Feb 03 '13 at 08:53

answered Feb 03 '13 at 08:35

Jim Gray

675
5
5

You can use `for start in range(len(text) - length)` and kill the `if`. – Jon Purdy Feb 03 '13 at 08:37
True. Saves some ops as well. – Jim Gray Feb 03 '13 at 08:38
1

to generate the list, call: `accumulator.most_common()` – jfs Feb 03 '13 at 08:58
1

Probably because OP was looking for an algorithm, rather than an implementation. If I posted "I need an algorithm for a sort thats O(n log(n)) but doesn't degrade on mostly-sorted data, I'd rather see "Check out http://en.wikipedia.org/Timsort" than "Java's sort is already optimized for that use case."--even though it is Timsort-based. – Tim Lesher Sep 27 '13 at 12:55

score 2 · Answer 3 · answered Feb 04 '13 at 15:44

just pseudo code, and maybe this isn't the most beautiful solution, but I would solve like this:

function separateWords(String incomingString) returns StringArray{
  //Code
}

function findMax(Map map) returns String{
  //Code
}

function mainAlgorithm(String incomingString) returns String{
    StringArray sArr = separateWords(incomingString);
    Map<String, Integer> map; //init with no content
    for(word: sArr){
        Integer count = map.get(word);
        if(count == null){
            map.put(word,1);
        } else {
            //remove if neccessary
            map.put(word,count++); 
        }
   }
   return findMax(map);
}

Where map can contain a key, value pairs like in Java HashMap.

A more efficient solution would use [a suffix tree](https://stackoverflow.com/questions/37499968/finding-all-repeated-substrings-in-a-string-and-how-often-they-appear) to find the most frequent substrings. — Anderson Green, Feb 15 '22 at 04:00

score 2 · Answer 4 · answered Mar 06 '18 at 10:19

Since for every substring of a String of length >= 2 the text contains at least one substring of length 2 at least as many times, we only need to investigate substrings of length 2.

val s = "hello world this is hello world. hello world repeats three times in this string!"

val li = s.sliding (2, 1).toList
// li: List[String] = List(he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " t", th, hi, is, "s ", " i", is, "s ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, d., ". ", " h", he, el, ll, lo, "o ", " w", wo, or, rl, ld, "d ", " r", re, ep, pe, ea, at, ts, "s ", " t", th, hr, re, ee, "e ", " t", ti, im, me, es, "s ", " i", in, "n ", " t", th, hi, is, "s ", " s", st, tr, ri, in, ng, g!)

val uniques = li.toSet
uniques.toList.map (u => li.count (_ == u))
// res18: List[Int] = List(1, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 1, 3, 2, 1, 3, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 3, 3, 2, 4, 1, 2, 2, 1)

uniques.toList(6)
res19: String = "s "

What language is this program written in? – Anderson Green Jul 07 '22 at 21:05 — Anderson Green, Jul 07 '22 at 21:05

Déjà vu · Answer 5 · 2013-02-03T13:07:43.270

Perl, O(n²) solution

my $str = "hello world this is hello world. hello world repeats three times in this string!";

my @words = split(/[^a-z]+/i, $str);
my ($display,$ix,$i,%ocur) = 10;

# calculate

for ($ix=0 ; $ix<=$#words ; $ix++) {
  for ($i=$ix ; $i<=$#words ; $i++) {
    $ocur{ join(':', @words[$ix .. $i]) }++;
  }
}

# display 

foreach (sort { my $c = $ocur{$b} <=> $ocur{$a} ; return $c ? $c : split(/:/,$b)-split(/:/,$a); } keys %ocur) {
  print "$_: $ocur{$_}\n";
  last if !--$display;
}

displays the 10 best scores of the most common sub strings (in case of tie, show the longest chain of words first). Change $display to 1 to have only the result.
There are n(n+1)/2 iterations.

Algorithm to find the most common substrings in a string

5 Answers5

Linked

Related