You don't need a suffix tree; just split sequences and use a sliding window iterator to produce substrings; a loop starting at a minimum size looping up to the number of numbers on the line will let you produce different sizes of substrings.
A collections.Counter()
object can then keep track of counts:
from collections import Counter
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
minimum_size = 2
counts = Counter()
with open(filename) as infh:
for line in infh:
parts = line.strip().split('-')
# count substrings of various sizes
counts.update('-'.join(sp)
for size in range(minimum_size, len(parts) + 1)
for sp in window(parts, size))
The Counter.most_common()
method then gives you a sorted list substrings sorted by frequency:
for substring, count in counts.most_common():
print '{:<20} {:>3d} times'.format(substring, count)
For your sample data, the top 10 entries are:
6-6 5 times
6-6-1 3 times
2-6 3 times
6-1 3 times
2-6-6 2 times
1-8 2 times
6-1-8 2 times
6-6-1-8 2 times
2-6-6-1 2 times
18-24-2-6-6 1 times
If you want to focus on different minimum substring lengths, you don't have to re-count the whole file. You could instead filter the Counter
object; count the -
characters in the keys:
for substr, freq in counts.most_common():
if substr.count('-') < 2:
# skip substrings shorter than 3 elements
continue