I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.
This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k
descendants, and finally find the deepest of those nodes. Theory-wise I'm set.
I've gotten a lot of help from the following resources (oops, I can only post 2):
I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.
Here's some sample data (first line: sequence
, second line: k
, suffix table format: parent child location length
):
CATACATAC$
2
1 2 1 1
1 7 2 1
1 14 3 3
1 17 10 1
2 3 2 4
2 6 10 1
3 4 6 5
3 5 10 1
7 8 3 3
7 11 5 1
8 9 6 5
8 10 10 1
11 12 6 5
11 13 10 1
14 15 6 5
14 16 10 1
The output from this should be CATAC
.
With the following code (modified from LiteratePrograms) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.
#authors listed at
#http://en.literateprograms.org/Depth-first_search_(Python)?action=history&offset=20081013235803
class Vertex:
def __init__(self, data):
self.data = data
self.successors = []
def depthFirstSearch(start, isGoal, result):
if start in result:
return False
result.append(start)
if isGoal(start):
return True
for v in start.successors:
if depthFirstSearch(v, isGoal, result):
return True
# No path was found
result.pop()
return False
def lrep(seq,reps,tree):
n = 2 * len(seq) - 1
v = [Vertex(i) for i in xrange(n)]
edges = [(int(x[0]),int(x[1])) for x in tree]
for a, b in edges:
v[a].successors.append(v[b])
paths = {}
for x in v:
result = []
paths[x.data] = []
if depthFirstSearch(v[1], (lambda v: v.data == x.data), result):
path = [u.data for u in result]
paths[x.data] = path
What I'd like to do is pre-process the tree to find nodes which satisfy the descendants >= k
requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.
So, my first-most-important question is: "How do I preprocess the tree with descendant leaves?"
My second-less-important question is: "After that, how can I quickly compute depth?"
P.S. I should state that this isn't homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.