Assume I have a txt file, each line represents a string. Is there some efficient way to find out top 10 frequent substrings.
The difficulty is that there are too large size of substring permutation for a given string. Given a N
length of string, it has total C(N,0)+C(N,1)+..C(N,N)
kinds of substring.
================================================ [Update]
The question is similar with "[a link]Algorithm to find the most common substrings in a string" , but both are NOT the SAME. The difference is that I tried to find out top 10 frequent substring among all strings while it is only to find longest substring in one string in "[a link] Algorithm to find the most common substrings in a string to find the most common substrings in one string" which is only local optimization.
Though one substring is infrequent in all strings via method in "[a link] Algorithm to find the most common substrings in a string", it is possible to be most frequent. For example, I have 10 strings,
string most frequent
str1 sub_str1 --4 times
str2 sub_str2 -- 4 times
..
str10 sub_str10
The most frequent substring of each string are different and each occurs 4 times. There is possibility that another substring named sub_minor that exists occurs in all string and happens only 1 times. As a result, this sub_minor string is most frequent since it occurs 10 which exceeds all other sub_str string.
All sub_str all only local optimization instead of global optimization and my problem is mainly for global optimization, which is different from "[a link] Algorithm to find the most common substrings in a string"