0

I am trying to find a continuous (with strictly increasing values) path through a list of lists. I have tried various recursive and reversed approaches, but have failed for hours.

The problem stems from interval-based pattern mining. Here, each event has exact one start and end time. The problem looks like this:

time = [[11, 38, 40], [12, 39, 49], [41], [4, 23, 43], [47], [17, 35, 60]]
Events = [[start time events A], [end time events A], [start time events B], [start time events C], [end time events B], [end time events C]]

The goal is to find all possible routes through data where

time[i] < time[i+1] holds.

A concrete example:

route_1 = [[11], [12], [41], [43], [47], [60]]
route_2 = [[38], [39], [41], [43], [47], [60]]

What is not valid: route_3 = [[11], [39], [41], [43], [47], [60]] because [11] describes the start time of Event A[0], but [39] represents the end time of Event A[1].

Can you name a suitable approach to solving this problem?

For my last (non-recursive) approach, I used a dictionary as a data representation. The approach produced the "closest" result to the expected result

time_reversed = {'end_C': [17, 35, 60], 'end_B': [47], 'start_C': [ 4, 23, 43], 'start_B': [41], 'end_A': [12, 39, 49], 'start_A': [11, 38, 40]}

import numpy as np    
def consecutive(time):
    time_pruned = time.copy()
    keys = list(time.keys())

    for i in range(len(keys)-1):
        lower = time_pruned[keys[i+1]]
        max_lower = np.nanmax(lower)
        max_upper = np.nanmax(time_pruned[keys[i]])

        while max_lower > max_upper:
            lower = np.delete(lower, np.nanargmax(lower))
    
            max_lower = np.nanmax(lower)
            time_pruned[keys[i+1]] = lower

    return time_pruned

However, this approach is not valid, since it does not consider the event-affiliation and do not now at the moment, how to consider everything in an efficient way.

The function above yields:

consecutive(time_reversed)
>>>{'end_C': [17, 35, 60],
 'end_B': [47],
 'start_C': [4, 23, 43],
 'start_B': [41],
 'end_A': array([12, 39]),
 'start_A': array([11, 38])}

Update 1: I've tried to describe the approach more detailed. I also tried to bring it into code but failed while deleting elements of the list while iterating over it.

Start position for an element wise comparison is

i = 0
j = 0
l = data_pruned[keys[i]][j] = 11
r = data_pruned[keys[i+1]][j] = 12

Whereby check() is defined as l < r

Iteration 1:

11 -> 12 -> 41 -> 4 (check() = False) remove 4 from start_C and 17 from end_C since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38, 40],
                'end_A': [12, 39, 49],
                'start_B': [41],
                'start_C': [23, 43],
                'end_B': [47],
                'end_C': [35, 60]}

Iteration 2:

11 -> 12 -> 41 -> 23 (check() = False) remove 23 from start_C and 35 from end_C since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38, 40],
                'end_A': [12, 39, 49],
                'start_B': [41],
                'start_C': [43],
                'end_B': [47],
                'end_C': [60]}

Iteration 3:

11 -> 12 -> 41 -> 43 -> 47 -> 60 --> Possible route since each value is strictly increasing

-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Iteration 4:

38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing

-> Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Iteration 5:

38 -> 39 -> 41 -> 47 -> 60 --> Possible route since each value is strictly increasing

40 -> 49 -> 41 (check() = False) remove 49 from end_a and 40 from start_A since each event consists of exact one start and end time

data_pruned = { 'start_A': [11, 38],
                'end_A': [12, 39],
                'start_B': [41],
                'start_C': [43],
                'end_B': [47],
                'end_C': [60]}

Do we have other possible strictly increasing combinations/routes/paths in data_pruned?

Done since all possibilities checked

Update 2: I export the data from database into different data representations such as pandas.

import pandas as pd
from io import StringIO

data = """\
event,occur,start,end
A,1,11,12
A,2,38,39
A,3,40,49
B,1,41,47
C,1,4,17
C,2,23,35
C,3,43,60
"""

# Read the data into a DataFrame
df = pd.read_csv(StringIO(data))
lnxdx
  • 3
  • 2
  • Can you add the code that you tried? – shaik moeed Aug 25 '23 at 16:02
  • Added a snippet. – lnxdx Aug 25 '23 at 16:27
  • In the example the difference between start and end is always the same for A (similarly for B). Is this a given, or is this just a coincidence in the example? – trincot Aug 25 '23 at 17:12
  • This is indeed just a coincidence in the given example. – lnxdx Aug 25 '23 at 17:54
  • In your program's output you don't seem to produce routes. It looks more like a reduced set. But this cannot work: for some routes, some times may be valid, but for others not. So you cannot simply reduce the input. Secondly, efficiency will be a relative concept here. If you need all possible routes, this might be an exponential amount. – trincot Aug 25 '23 at 18:10
  • Are all individual time values unique? If not, how will you know from the output where a value was taken from? – trincot Aug 25 '23 at 18:46
  • Is this the expected result?: `[4,11,12,17,41,47]` `[11,12,23,35,41,47]` `[11,12,41,43,47,60]` `[4,17,38,39,41,47]` `[23,35,38,39,41,47]` `[38,39,41,43,47,60]` `[4,17,40,41,47,49]` `[23,35,40,41,47,49]` `[40,41,43,47,49,60]` – trincot Aug 25 '23 at 18:50
  • Can you please comment on my requests for clarification? – trincot Aug 26 '23 at 13:05

1 Answers1

0

If I understand correctly, you have groups (A, B, C in the example), and each group has intervals (start/end). You want to use exactly one interval from each group, and have a sorted list of the involved times.

A constraint is that the order of times should follow a given pattern. A pattern could for example be A-start, A-end, B-start, C-start, B-end, C-end. Outputs that do not follow this sequence should be excluded.

Proposed algorithm

As preprocessing step, you could organise the input per group, and for each determine the two indices of the route where they will write the start and end index of an interval. These indices are determined by the pattern input. For instance, for the example in the question, we would have:

A: [0, 1]  # spans are written to these indices of the output route
B: [2, 4]
C: [3, 5]

Then determine for each output index, which other two indices would then already be filled in and would determine the "window" for the value that can be written to the targeted index. For the above example, the route would be populated in this order (determined by the pattern):

For A, we fill in the start at index 0
For A, we fill in the end   at index 1
For B, we fill in the start at index 2
For B, we fill in the end   at index 4  # note we skip index 3!
For C, we fill in the start at index 3
For C, we fill in the end   at index 5

Now when we are at the step where we fill in a end time for B, we should make sure it is not less than the value stored at index 2. This is always the same index, so we want to calculate this index up front (in its relation to index 4).

Similarly, when we are at the next step, where we fill in a start time for C at index 5, we should make sure it is not more than the value already stored at index 4. So again we want to prepare for this and set up this limit relationship between index 4 and 5.

We can use a stack-based algorithm for this (linear time complexity) and so determine for each index where we have to look to find the other two indices that will already have values and which set the "window" for the current value.

Once that preprocessing is done, we can start the actual DFS search for solutions.

There are more little aspects in this algorithm, but they should be clear from the comments in the code below:

Implementation

def routes(data, pattern):
    # setlimits: Helper function to determine which indices in the route
    #    determine the range that the value at index i of a route can have
    #    given the order in which the route is populated (pattern).
    def setlimits(pattern):
        stack = [-1]
        limit = []
        for i, key in enumerate(pattern):
            while len(stack) > 1 and pattern[stack[-1]] >= key:
                stack.pop()
            limit.append(stack[-1] + 1)
            stack.append(i)
        return limit
            
    n = len(pattern) // 2
    # Get sorted list of event keys, relying on sorted dict (Python 3.6+)
    keymapper = { key : 0 for key in pattern }
    # Map each key (string) to a unique sequence number (0..n-1)
    keymapper = { key : i for i, key in enumerate(keymapper) }
    # Convert the pattern to use these sequence numbers instead of key strings
    pattern = [keymapper[key] for key in pattern]
    # Construct data structure that gives for each group the two indices 
    #    in a route where the start/end of a span have to be written
    schedule = [[-1, 0, []] for _ in range(n)]
    # Populate the list of start/end spans for each group...
    #     NB: the data is assumed to be sorted by seq column per group
    for key, seq, start, end in data:  
        schedule[keymapper[key]][2].append((start, end))
    # Populate the two indices for where these spans should be output in a route
    for i, key in enumerate(pattern):
        schedule[key][schedule[key][0] > -1] = i
    
    inf = float("inf")
    # route is a buffer to populate during DFS algorithm
    route = [-inf] * (n*2+2)  # an extra entry before and after the route
    route[-1] = inf

    # Prepare dependencies that imply constraints
    low = setlimits(pattern)
    high = [len(route) - 1 - i for i in setlimits(pattern[::-1])[::-1]]

    # Recursive algorithm to assign pairs of start/end times to routes
    def dfs(event):
        if event >= n: # A route has been completed: 
            yield route[1:-1]  # exclude the dummy entries from the route
            return
        i, j, spans = schedule[event]
        # Determine the "windows" for values at i, and at j in the route
        startlow = route[low[i]]
        starthigh = route[high[i]]
        endlow = route[low[j]]
        endhigh = route[high[j]]
        # Try each of the spans...
        for start, end in spans:
            if startlow <= start <= starthigh and endlow <= end <= endhigh:
                # This span is acceptable. Write it to the route and recur...
                route[i+1] = start
                route[j+1] = end
                yield from dfs(event + 1)
    
    return list(dfs(0))

Here is how you would run it for the example in the question:

data = [
    ["A", 1, 11, 12],
    ["A", 2, 38, 39],
    ["A", 3, 40, 49],
    ["B", 1, 41, 47],
    ["C", 1, 4, 17],
    ["C", 2, 23, 35],
    ["C", 3, 43, 60],
]
pattern = ["A","A","B","C","B","C"]

for route in routes(data, pattern):
    print(route)

This outputs:

[11, 12, 41, 43, 47, 60]
[38, 39, 41, 43, 47, 60]
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Thank you very much for the answer. You are correct, all the points in your answer are increasing, but as described in update 1 of my question, a "path" in my problem also takes into account that an event consists of exactly one start and end time, which must not mix. – lnxdx Aug 26 '23 at 12:31
  • With "not mix", do you mean that the order of the input must be preserved? Can you then please do me a favour and provide the **exact** input structure you initially get to work with? Because apparently I already started wrong by defining `time` as I did. However, in my answer I do only produce one start and one end time per group, so I'm not sure what you refer to. – trincot Aug 26 '23 at 13:02
  • Can you give me a concrete example in the output I listed, which route is not expected, and why (in detail)? – trincot Aug 26 '23 at 13:04
  • We must maintain the sequence of events that occur. Start A -> End A -> Start B -> Start C -> End B -> End C must be constant. Thus: [4, .., 47] -> 4 is part of C, which cannot before A [11, .... 47] -> Same holds for 23; this is the second start time from C [11, ..., 60] -> Valid [4, ..., 47] -> 4 is part of C which cannot before A [23,..., 47] -> 23 4 is part of C which cannot before A [38, ..., 60] -> Valid [4, ..., 49] -> 4 is part of C which cannot before A [23, ..., 49] -> Initial order must be preserved [40, ..., 60] -> Initial order must be preserved – lnxdx Aug 26 '23 at 13:38
  • OK, I think I understand. There is order in the input. Do realise that dicts didn't maintain order in Python prior to 3.6. Anyway I will rewrite my answer with this in mind. – trincot Aug 26 '23 at 13:41
  • Can you still provide in your question the exact input structure? The dict is not a very nice structure here, because you would have to compare keys and assume some "structure" in those key names. – trincot Aug 26 '23 at 13:51
  • I've added a pandas export. – lnxdx Aug 26 '23 at 14:30
  • OK, and where do you get the required order from (A A B C B C)? Do you also have a text input format for that? – trincot Aug 26 '23 at 14:36
  • This pattern comes from a pattern mining approach. With the question posed, I check whether the pattern provided is a subset of a large pattern in the database. To do this, I need to check the continuity condition – lnxdx Aug 26 '23 at 14:49
  • What is the format of this pattern? Is it a string, a list, a dictionary, ...? – trincot Aug 26 '23 at 14:55
  • OK, in absence of that info, I went ahead and rewrote my answer. Hope it is OK. – trincot Aug 27 '23 at 08:45
  • 1
    The provided solution works for the described use case. Thanks @trincot! – lnxdx Aug 30 '23 at 16:36