How to find top 10 frequent substring from string databases

Question

Assume I have a txt file, each line represents a string. Is there some efficient way to find out top 10 frequent substrings.

The difficulty is that there are too large size of substring permutation for a given string. Given a N length of string, it has total C(N,0)+C(N,1)+..C(N,N) kinds of substring.

================================================ [Update]

The question is similar with "[a link]Algorithm to find the most common substrings in a string" , but both are NOT the SAME. The difference is that I tried to find out top 10 frequent substring among all strings while it is only to find longest substring in one string in "[a link] Algorithm to find the most common substrings in a string to find the most common substrings in one string" which is only local optimization.

Though one substring is infrequent in all strings via method in "[a link] Algorithm to find the most common substrings in a string", it is possible to be most frequent. For example, I have 10 strings, string most frequent str1 sub_str1 --4 times
str2 sub_str2 -- 4 times ..
str10 sub_str10

The most frequent substring of each string are different and each occurs 4 times. There is possibility that another substring named sub_minor that exists occurs in all string and happens only 1 times. As a result, this sub_minor string is most frequent since it occurs 10 which exceeds all other sub_str string.

All sub_str all only local optimization instead of global optimization and my problem is mainly for global optimization, which is different from "[a link] Algorithm to find the most common substrings in a string"

How to find top 10 frequent substring from string databases

0 Answers0