Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5
(i.e., target=5
) anywhere within the list, what are the top N=2
most frequently observed subsequences with length M=4
?
So, the conditions are:
- if
target
doesn't exist in the list then we ignore that list completely - if the list length is less than
M
then we ignore the list completely - if the list is exactly length
M
buttarget
is not in theMth
position then we ignore it (but we count it iftarget
is in theMth
position) - if the list length,
L
, is longer thanM
andtarget
is in thei=M
position(or
i=M+1position, or
i=M+2position, ...,
i=Lposition) then we count the subsequence of length
Mwhere
target` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2
subsequences by frequency. So, [2, 3, 4, 5]
and [12, 12, 6, 5]
are the top two most frequent sequences by count. If N=3
then all of the subsequences (subseqs
) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
- consists of a few billion lists of positive integers (between 1 and 10,000)
- each list can be as short as 1 element or as long as 500 elements
N
andM
can be as small as 1 or as big as 100
My questions are:
- Is there an efficient data structure that would allow for fast queries assuming that
N
andM
will always be less than 100? - Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of
N
andM
?