Problem:
I need all the sequences of characters that meet the following:
- Sequence of characters must be present more than once ((LE, 1) is thus invalid).
- Sequence of characters must be longer than one character ((M, 2) is thus invalid).
- Sequence of characters must not be part of a longer existing sequence that is present the same number of times ((LI, 2) is thus invalid if (LIO, 2) is present).
So, if the input string was: KAKAMNENENELIOLELIONEM$
The output would be:
(KA, 2)
(NE, 4)
(LIO, 2)
It also needs to be fast, it should be able to solve a 1000 character long string in a reasonable amount of time.
What I have tried:
Getting amount of branches from suffix tree:
Editing this suffix tree -creating librabry(Python-Suffix-Tree), I made a program that gives somewhat erroneus results.
I added this function to the SuffixTree class in suffix_tree.py:
def get_repeated_substrings(self):
curr_index = self.N
values = self.edges.values()
values = sorted(values, key=lambda x: x.dest_node_index)
data = [] # index = edge.dest_node_index - 1
for edge in values:
if edge.source_node_index == -1:
continue
top = min(curr_index, edge.last_char_index)
data.append([edge.source_node_index,
self.string[edge.first_char_index:top+1]])
repeated_substrings = {}
source_node_indexes = [i[0] for i in data]
nodes_pointed_to = set(source_node_indexes)
nodes_pointed_to.remove(0)
for node_pointed_to in nodes_pointed_to:
presence = source_node_indexes.count(node_pointed_to)
key = data[node_pointed_to-1][1]
if key not in repeated_substrings:
repeated_substrings[key] = 0
repeated_substrings[key] += presence
for key in repeated_substrings:
if len(key) > 1:
print(key, repeated_substrings[key])
And then used it and the rest of the library like this:
from lib.suffix_tree import SuffixTree
st = SuffixTree("KAKANENENELIOLELIONE$")
print(st.get_repeated_substrings())
Output:
KA 2
NE 7
LIO 2
IO 4
get_repeated_substrings() basically goes through all connections between nodes (called edges in this library) and saves how many more connections the node it points towards has (it saves it to repeated_substrings), and then it prints the saved values that are more than one character long.
It appends the number of connections to the amount it already had for that sequence, which works most of the time, but as you can see in the output above, it resulted in an incorrect value for 'NE' (7, it should have been 4). After solving this problem, I realised that this method won't detect patterns made out of the same character (AA, BB), and other malfunctions. My conclusion: Either there is no way to solve it with suffix trees, or I am doing something very wrong.
Other methods:
I have also tried some more straightforward ways, consisting of looping through stuff, but that hasn't been a success either:
import copy
string = 'kakabaliosie'
for main_char in set(string):
indices = []
for char_i, char in enumerate(string):
if main_char == char:
indices.append(char_i)
relative = 1
while True
for index in indices:
other_indices = copy.deepcopy(indices)
other_indices.remove(index)
for other_index in other_indices:
(can't manage to finish it)