I am trying to solve a pattern mining problem for strings and I think that suffix trees or arrays might be a good option to solve this problem.
I will quickly outline the problem:
I have a set strings of different lengths (quotation are just to mark repetitions for the explanation):
- C"BECB"ECCECCECEEB"BECB"FCCECCECCECCECCFCBFCBFCC
- DCBBDCDDCCECBCEC"BECB""BECB"BECB"BECB"BECB"EDB"BECB""BECB"BEC
- etc.
I now would like to find repeated patterns within each string and repeated patterns that are common between the strings. A repeated pattern in string (1) would be "BECB". Moreover the pattern "BECB" is also repeated in string (2). Of course there are several criteria that need to be decided on as for example the minimum number of repetitions or the minimum number of symbols in a pattern.
From the book by Dan Gusfield "Algorithms on Strings, Trees and Sequences" I know that it is possible to find such repeats (e.g. maximal pairs, maximal repetitive structures etc.) using certain algorithms and a suffix tree data structure. This comes in handy as I would like to use probabilistic suffix trees to also calculate some predictions on these sequences. (But this is not the topic of this post.)
Unfortunately I can't find any implementations of these algorithms. Hence, I am wondering if I am even on the right path using suffix trees to solve the mentioned problem. It seems very strange to me that for such a well described problem no packages are available (in R or Python for example).
Are there any alternatives (with existing packages) that solve my problem?
Or do you know any implementation of algorithms for suffix trees?