3

Let's say I have the following list of lists:

x = [[1, 2, 3, 4, 5, 6, 7],  # sequence 1
     [6, 5, 10, 11],  # sequence 2
     [9, 8, 2, 3, 4, 5],  # sequence 3
     [12, 12, 6, 5],  # sequence 4
     [5, 8, 3, 4, 2],  # sequence 5
     [1, 5],  # sequence 6
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  # sequence 7
     [7, 1, 7, 3, 4, 1, 2],  # sequence 8
     [9, 4, 12, 12, 6, 5, 1],  # sequence 9
]

Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?

So, the conditions are:

  1. if target doesn't exist in the list then we ignore that list completely
  2. if the list length is less than M then we ignore the list completely
  3. if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
  4. if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence

So, using our list-of-lists example, we'd count the following subsequences:

subseqs = [[2, 3, 4, 5],  # taken from sequence 1
           [2, 3, 4, 5],  # taken from sequence 3
           [12, 12, 6, 5],  # taken from sequence 4
           [8, 8, 3, 5],  # taken from sequence 7
           [1, 4, 12, 5],  # taken from sequence 7
           [12, 12, 6, 5],  # taken from sequence 9
]

Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.

This is super simplified but, in reality, my actual list-of-lists

  1. consists of a few billion lists of positive integers (between 1 and 10,000)
  2. each list can be as short as 1 element or as long as 500 elements
  3. N and M can be as small as 1 or as big as 100

My questions are:

  1. Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
  2. Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of N and M?
slaw
  • 6,591
  • 16
  • 56
  • 109
  • Looks like you actually have 2 fairly separate problems: 1. Find a way to efficiently generate a stream of subsequences and 2. Find a way to efficiently pick the top N entries in that sequence. For 2. you need to iterate though all of subsequence, and you can probably use some sort of prefix-based tree structure to keep count of the entries. For 1. I don't think you can find a way that doesn't involve iterating through every list completely, with a cache that's M entries long. I'd say best efficiency is linear in total number of elements. – Henry Henrinson Jan 27 '20 at 16:07
  • Actually, I'm overthinking this - for 1. you can get away with a dictionary, if space is not an issue for you. – Henry Henrinson Jan 27 '20 at 16:12
  • As a separate point, this parallelises nicely, if needed. – Henry Henrinson Jan 27 '20 at 16:14
  • I think some of your numbering is wrong, should taken from sequence 5 be taken from sequence 4 and taken from sequence 8 be taken from sequence 9? – Jackson Jan 27 '20 at 16:26
  • In order to avoid needing a cache you could scan each list backwards, when you find the target your dictionary key is the next M elements - if there are M elements left. – Jackson Jan 27 '20 at 16:35
  • Actually I'm still not entirely clear what you meant by *If `N=3` then all of the subsequences (`subseqs`) would be returned since there is a tie for third.* Perhaps if you can illustrate a better example it would make more sense. – r.ook Jan 27 '20 at 17:34
  • @Jackson good catch on the sequence numbering. Corrected now. – slaw Jan 27 '20 at 17:43
  • @r.ook `N=3` means to return the top 3 most frequent subsequences. Since the most frequent subsequence (there are two of them) has a count of 2 then they are tied for first and second. Then, the remaining subsequences only have a count of 1 and so they are indistinguishable and therefore they are tied for the 3rd and 4th spot. – slaw Jan 27 '20 at 17:46
  • Okay, I guess I understands it a little bit better. I've updated my answer with that feedback. – r.ook Jan 27 '20 at 19:36
  • Since there is not an only N, M and target I assume there are chunks of lists with lists, can you tell what is the average size of a chunk? (how many lists are in a chunk) – kederrac Jan 27 '20 at 21:49
  • what is the desired output for each chunk? – kederrac Jan 27 '20 at 22:08

4 Answers4

0

Here is an idea, based on a generalized suffix tree structure. Your list of lists can be seen as a list of strings, where the alphabet would consist of integers (so about 10k characters in the alphabet with the info you provided).
The construction of a generalized suffix tree is done in linear time w.r.t the string length, so this should not be an issue since in any case, you will have to go though your lists at some point.

First, store all your strings in the suffix tree. This requires 2 small adaptations of the structure.
You need to keep a counter of the number of occurences of a certain suffix, since your goal is ultimately to find the most common subsequence respecting certains properties.

Then, you also want to have a lookup table from (i, d) (where i is the integer you're looking for, the target, and d is the depth in your tree, the M) to the set of nodes of your suffix link that are labeled with the 'letter' i (your alphabet is not made of chars, but of integers), located at a depth d. This lookup table can be build by traversing your suffix link (BFS or DFS). You can even possibly store only the node that corresponds to the highest counter value.

From there, for some query (target, M), you would first look in your lookup table, and then find the node in the tree with the highest counter value. This would correspond to the most frequently encountered 'suffix' (or subsequence) in the list of lists.

The implementation is quite complex, since the generalized suffix tree is not a trivial structure (at all), and implementing it correctly, with modifications, would not be a small feat. But I think that this would allow for a very efficient query time.

For a suffix tree implementation, I would recommend that you read only the original papers until you get a deep and real understanding of those(like this or that, sc*-h*b can be your friend) on the matter, and not the online 'explanations' of it, that are riddled with approximations and mistakes (even this post can help to get a first idea, but will misdirect you at some point if your goal is to implement a correct version).

m.raynal
  • 2,983
  • 2
  • 21
  • 34
  • no, I've never heard of any python package that implements a suffix tree, I guess it would have to be 100% homemade (did one in the past, but the code is not mine to share). But even if you don't go for it, I suggest you have a look at the suffix tree structure, it's an amazing design and can become handy in some situations. – m.raynal Jan 27 '20 at 18:40
  • 1
    I came across it while searching for a solution so your response has been reassuring. – slaw Jan 27 '20 at 18:58
  • Is there a way to build this tree to count sequences of different lengths at the same time? So, length `M=1`all the may up to `M=50` and for any `target`? Or would we need to re-build the tree for each different `M` and each different `target`? – slaw Jan 28 '20 at 20:50
  • When you say "efficient query time" do you think this would be less than a seconds, less than 10 seconds, less than a minute, less than an hour? – slaw Jan 28 '20 at 22:13
  • Basically, it depends on the implementation (python is not really fast), and on the precomputation time .... but with a couple hours of precomputing, I'd say you could simply precompute all possible queries, and return a result in a few dozens miliseconds. – m.raynal Jan 29 '20 at 06:20
0

To answer your first question: you can put all lists in an array, fixing the length by extending zeroes so the array becomes something you can work with. From an answer here

x = [[1, 2, 3, 4, 5, 6, 7],  # sequence 1
     [6, 5, 10, 11],  # sequence 2
     [9, 8, 2, 3, 4, 5],  # sequence 3
     [12, 12, 6, 5],  # sequence 4
     [5, 8, 3, 4, 2],  # sequence 5
     [1, 5],  # sequence 6
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  # sequence 7
     [7, 1, 7, 3, 4, 1, 2],  # sequence 8
     [9, 4, 12, 12, 6, 5, 1],  # sequence 9
]

lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)

mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
arr
>> [[ 1  2  3  4  5  6  7  0  0  0  0]
[ 6  5 10 11  0  0  0  0  0  0  0]
 [ 9  8  2  3  4  5  0  0  0  0  0]
 [12 12  6  5  0  0  0  0  0  0  0]
 [ 5  8  3  4  2  0  0  0  0  0  0]
 [ 1  5  0  0  0  0  0  0  0  0  0]
 [ 2  8  8  3  5  9  1  4 12  5  6]
 [ 7  1  7  3  4  1  2  0  0  0  0]
 [ 9  4 12 12  6  5  1  0  0  0  0]]

For the second question: use np.where to find the different positions matching your condition. Then you can broadcast the indeces of row and column by adding dimensions to include the 5s and the preceding 4 elements:

M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2,  3,  4,  5],
   [ 2,  3,  4,  5],
   [12, 12,  6,  5],
   [ 8,  8,  3,  5],
   [ 1,  4, 12,  5],
   [12, 12,  6,  5]])
Tarifazo
  • 4,118
  • 1
  • 9
  • 22
0

There's two parts to your question:

To generate the sub sequences you wanted, you can use a generator to help you:

def gen_m(lst, m, val):
    '''
    lst = sub_list to parse
    m = length required
    val = target value
    '''

    found = 0                                  # starts with 0 index
    for i in range(lst[m-1:].count(val)):      # repeat by the count of val
        found = lst.index(val, found) + 1      # set and find the next index of val
        yield tuple(lst[found-m: found])       # yield the sliced sub_list of m length as a tuple

Then, using another generator, you can create a Counter of your sub lists:

from collections import Counter
target = 5
req_len = 4

# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))

Then, create a generator to check the counter object to return the N count required:

req_N = 2

def gen_common(counter, n):
    s = set()
    for i, (item, count) in enumerate(counter.most_common()):
        if i < n or count in s:
            yield item
        else:
            return
        s.add(count)

result = list(gen_common(counter, req_N))

Results where N == 2:

[[2, 3, 4, 5], [12, 12, 6, 5]]

Results where N == 3:

[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]

With a larger sample:

x = [[1, 2, 3, 4, 5, 6, 7],  
     [6, 5, 10, 11],  
     [9, 8, 2, 3, 4, 5],  
     [12, 12, 6, 5],  
     [5, 8, 3, 4, 2],  
     [1, 5],  
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  
     [7, 1, 7, 3, 4, 1, 2],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 2, 3, 4, 5, 1],  
     [9, 4, 8, 8, 3, 5, 1],  
     [9, 4, 7, 8, 9, 5, 1],     
     [9, 4, 1, 2, 2, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 1, 4, 12, 5],  
     [9, 1, 4, 12, 5, 1]  
]

Where Counter is now:

Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})

You can get results such as these:

for i in range(6):
    # testing req_N from 0 to 5
    list(gen_common(c, i))

# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]
r.ook
  • 13,466
  • 2
  • 22
  • 39
0

Since there is not an only N, M and target I assume there are chunks of lists with lists. Here is an approach in O(N + M) time complexity fashion (where N is the number of lists in a chunk and M is the total number of elements):

def get_seq(x, M, target):
    index_for_length_m = M - 1
    for v in [l for l in x if len(l) >= M]:
        for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
            # convert to str to be hashable
            yield str(v[i - index_for_length_m : i + 1])

def process_chunk(x, M, N, target):
    return Counter(get_seq(x, M, target)).most_common(N)

with your example:

process_chunk(x, M, 2, target)

output:

[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]

the performence:

%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
kederrac
  • 16,819
  • 6
  • 32
  • 55