Top of the most frequent substrings in file of sequences

Question

I have a file with many sequences (all of them have different length) like these:

5-2-6
5-6-6-6-1-8
4
1-7-6-6-2-6-6-1
18-24-2-6-6-1-8

I need to calculate the most frequent substrings in all sequences and get result like this:

6-6       5 times
2-6       3 times
6-6-1     3 times
6-1       3 times
6-1-8     2 times
1-8       2 times
2-6-6-1   2 times
2-6-6     2 times

Substrings can be any length starting from 2 numbers. I should provide the opportunity in code to change length of the substring, for example, to find substrings longer than 3 numbers.

I know I should use suffix tree, but I can't find working example for my case((

Language: R or Python.

Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. — Martijn Pieters, Mar 11 '14 at 18:14

score 1 · Accepted Answer · edited May 23 '17 at 12:10

You don't need a suffix tree; just split sequences and use a sliding window iterator to produce substrings; a loop starting at a minimum size looping up to the number of numbers on the line will let you produce different sizes of substrings.

A collections.Counter() object can then keep track of counts:

from collections import Counter
from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

minimum_size = 2
counts = Counter()

with open(filename) as infh:
    for line in infh:
        parts = line.strip().split('-')
        # count substrings of various sizes
        counts.update('-'.join(sp) 
                      for size in range(minimum_size, len(parts) + 1)
                      for sp in window(parts, size))

The Counter.most_common() method then gives you a sorted list substrings sorted by frequency:

for substring, count in counts.most_common():
    print '{:<20} {:>3d} times'.format(substring, count)

For your sample data, the top 10 entries are:

6-6                    5 times
6-6-1                  3 times
2-6                    3 times
6-1                    3 times
2-6-6                  2 times
1-8                    2 times
6-1-8                  2 times
6-6-1-8                2 times
2-6-6-1                2 times
18-24-2-6-6            1 times

If you want to focus on different minimum substring lengths, you don't have to re-count the whole file. You could instead filter the Counter object; count the - characters in the keys:

for substr, freq in counts.most_common():
    if substr.count('-') < 2:
        # skip substrings shorter than 3 elements
        continue

Martijn, thank you! You save me a lot of time and show new features in Python. I really appreciate it! — user3406835, Mar 12 '14 at 08:05

score 1 · Answer 2 · answered Mar 11 '14 at 18:38

R isn't the best language for this, because trees are not (to my knowledge) implemented well. If you're looking for a quick-and-dirty solution, this will work. It is not efficient in any way:

# Create some data
set.seed(1)
data<-replicate(1000,sample(1:10,round(runif(1,1,8)),replace=TRUE))
# A function for all substrings
all.substrings<-function(s) unlist(sapply(1:length(s), function(y) sapply(y:length(s), function(x) s[y:x])),recursive=FALSE)
# Grab all strings
substrings<-unlist(sapply(data,all.substrings),recursive=FALSE)
# Strip one length substrings
substrings<-substrings[lapply(substrings,length)>1]
# Paste them together so you can easily aggregate
substrings.pasted<-sapply(substrings,paste,collapse=' ')
# Aggregate and sort.
sorted.results<-sort(table(substrings.pasted),decreasing=TRUE)
# Display
head(sorted.results)
# 10 3 4 10  8 1  9 4  9 2 8 10 
#   52   50   48   48   47   45

So the sequence 10 3 was present 52 times, the maximum.

Top of the most frequent substrings in file of sequences

2 Answers2