3

I have a list of tuples with each tuple carrying the following information:

(start_position,end_position,list of strings)

A sample list is given as follows:

aList = [(6, 9, ['ataH']), 
(4, 9, ['svataH']), 
(0, 9, ['vEvasvataH']), 
(2, 5, ['vasu', 'vasU']), 
(1, 3, ['Eva', 'eva']), 
(0, 1, ['vA', 'vE'])]

I need to find all sequences of tuples such that each sequence has to cover all positions from the start_position to end_position, in this case from 0 to 9. In a sequence, say a, adjacent tuples need to satisfy the constraints that a[i+1][0] - a[i][1] <= 1.

Summarily, the output should be as follows:

[[(0, 1, ['vA', 'vE']), (2,5,['vasu', 'vasU']), (6, 9, ['ataH'])  ],
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
 [(0, 9, ['vEvasvataH'], [7])]]

I have used the following code to achieve the same.

maxVal = max(aList,key=lambda item:item[1])[1]
diffSets = list()

for i,item in enumerate(aList):
    if maxVal == item[1]:  
        _temp = [item]        
        currStart = item[0]
        for j,stuff in enumerate(aList):
            if i != j:
                if currStart == stuff[1] or currStart == stuff[1]+1:                    
                    _temp.append(stuff)
                    currStart = stuff[0]        
        diffSets.append(_temp) #the output needs to be reversed to get the sequence in the order as shown above

Is there a more efficient and faster way to achieve the same, say using itertools?

Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65

3 Answers3

1

Here's how I would go about it (not to say that it's necessarily faster). First, you can start by sorting the data based on both start and end. This will mean when we look at combinations later we won't have to backtrack to other values in our result (we will know that the start of entry[i] must be less than or equal to the start of entry[i+1])

import itertools
import operator

data = [
    (6, 9, ['ataH']),
    (4, 9, ['svataH']),
    (0, 9, ['vEvasvataH']),
    (2, 5, ['vasu', 'vasU']),
    (1, 3, ['Eva', 'eva']),
    (0, 1, ['vA', 'vE'])
]

data = sorted(data, key=operator.itemgetter(0, 1))
start = min(data, key=operator.itemgetter(0))[0]
end = max(data, key=operator.itemgetter(1))[1]

Now we have the data sorted, and know our start and end values.

To find all of the combinations of the data for any size of subset, we can use this trick: https://stackoverflow.com/a/5898031/3280538

def all_combinations(data):
    for i in range(len(data)):
        yield from itertools.combinations(data, r=i+1)

Here we use yield to avoid creating expensive containers. Now we write another method that uses these combinations, and for each one checks it's validity:

def valid_combinations(data):
    for comb in all_combinations(data):
        if comb[0][0] != start or comb[-1][1] != end:
            continue
    
        for i in range(len(comb) - 1):
            if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                break
        else:
            yield comb

Here I'm using a neat trick with the for-loop. The else block on the loop will only execute if the for loop completes naturally and does not break, and if we don't break then we know each item is valid.

All together:

import itertools
import operator


def all_combinations(data):
    for i in range(len(data)):
        yield from itertools.combinations(data, r=i+1)


def valid_combinations(data):
    data = sorted(data, key=operator.itemgetter(0, 1))
    start = min(data, key=operator.itemgetter(0))[0]
    end = max(data, key=operator.itemgetter(1))[1]

    for comb in all_combinations(data):
        if comb[0][0] != start or comb[-1][1] != end:
            continue

        for i in range(len(comb) - 1):
            if not 0 <= comb[i+1][0] - comb[i][1] <= 1:
                break
        else:
            yield comb

Getting the results:

from pprint import pprint
pprint(list(valid_combinations(
    [
        (6, 9, ['ataH']),
        (4, 9, ['svataH']),
        (0, 9, ['vEvasvataH']),
        (2, 5, ['vasu', 'vasU']),
        (1, 3, ['Eva', 'eva']),
        (0, 1, ['vA', 'vE'])
    ]
)))
[((0, 9, ['vEvasvataH']),),
 ((0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])),
 ((0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH']))]
flakes
  • 21,558
  • 8
  • 41
  • 88
1

Let's assume that none of your nodes are allowed to overlap completely. You can set up a graph of the connections between nodes, and pass it off to something like networkx, which implements the accepted optimal search algorithms you are interested in.

You have a DAG with connections represented by overlaps between the end of one node and the start of the next. Here is how you could populate the graph:

import networkx as nx

g = nx.DiGraph()
for i, n in enumerate(aList):
    g.add_node(i, start=n[0], end=n[1], data=n[2])

The nodes are indices of the data in aList, with the data stored as attirbutes. Here is you can add edges. This is O(n2), but can be reduced to O(n log n) if you pre-sort the data appropriately:

for i, m in enumerate(aList):
    for j, n in enumerate(aList):
        if i != j and m[0] < n[0] and n[0] - m[1] <= 1 and m[1] < n[1]:
            g.add_edge(i, j)

Now you can find paths that span the entire graph. First identify starting points:

from operator import itemgetter

minValue = min(aList, key=itemgetter(0))[0]
maxValue = max(aList, key=itemgetter(1))[1]

start = [i for i, n in enumerate(aList) if n[0] == minValue]
end = [i for i, n in enumerate(aList) if n[1] == maxValue]

Finding the paths can be done with nx.all_simple_paths, among others:

paths = []
for node in start:
    if node in end:
        paths.append([node])
    else:
        paths.extend(nx.all_simple_paths(g, node, end))

Now you have a list like this:

[[2], [5, 3, 0], [5, 3, 1], [5, 4, 1], [5, 4, 3, 0], [5, 4, 3, 1]]

You can use the indices in the result to sample the original list, or use the metadata stored in the graph, depending on your preferences. Here is the former approach:

result = [[aList[item] for item in path] for path in paths]

The final result looks like this:

[[(0, 9, ['vEvasvataH'])],
 [(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])],
 [(0, 1, ['vA', 'vE']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])],  # Do you want this (5 > 4)?
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (4, 9, ['svataH'])],
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (6, 9, ['ataH'])], # Do you want this (3 > 2)?
 [(0, 1, ['vA', 'vE']), (1, 3, ['Eva', 'eva']), (2, 5, ['vasu', 'vasU']), (4, 9, ['svataH'])]] # Do you want this (3 > 2), (5 > 4)?

Since the paths are ordered, it's easy to see how they span the interval immediately.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Thanks a lot for the solution. In its current form, it overgenerates some solutions which arent valid (the ones you have marked using comments). However, you have illustrated the solution at large and I have got the clarity to modify it according to my requirements. – Amrith Krishna Dec 09 '21 at 08:48
  • 1
    @AmrithKrishna. You could add `n[0] - m[1] >= 0` into the condition, and possibly remove the constraint `m[0] < n[0]`. Pre-sorting this and using a recursive DFS algorithm is likely to be much faster anyway. – Mad Physicist Dec 09 '21 at 14:22
1

Here is an alternative solution that uses a dictionary with start_position as a key to speed up the search.

import operator as op
import itertools as it
from collections import defaultdict

# These two accessor functions make it more legible to work with the tuples
start_position = op.itemgetter(0)
end_position = op.itemgetter(1)

def find_all_paths(list_of_tuples):
    last_position = end_position(max(list_of_tuples, key=end_position))

    tuples_dict = defaultdict(list)
    for t in list_of_tuples:
        tuples_dict[start_position(t)].append(t)

    def all_paths_starting_from(next_value):
        candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])

        paths = []
        for candidate in candidates:
            if end_position(candidate) == last_position:
                paths.append([candidate])
            else:
                for branch in all_paths_starting_from(end_position(candidate)):
                    paths.append([candidate] + branch)
        return paths
    return all_paths_starting_from(-1)

The idea is that, in a large graph, the biggest time consumer would be searching the next possible nodes for a given node. The tuples_dict reduces this time to essentially O(1). Also, the maximum depth of the recursion will be len(tuples_dict).

If the input set is large and the graph has a lot of branching and merging, the all_paths_starting_from function can be memoized for performance (but the code gets less legible). The memoized solution would be:

def find_all_paths(list_of_tuples):
    last_position = end_position(max(list_of_tuples, key=end_position))

    tuples_dict = defaultdict(list)
    for t in list_of_tuples:
        tuples_dict[start_position(t)].append(t)

    known_paths = {}  # Dict from next_value to the possible paths
    def all_paths_starting_from(next_value):
        if next_value in known_paths:
            return known_paths[next_value]
        candidates = it.chain(tuples_dict[next_value], tuples_dict[next_value+1])

        paths = []
        for candidate in candidates:
            if end_position(candidate) == last_position:
                paths.append([candidate])
            else:
                for branch in all_paths_starting_from(end_position(candidate)):
                    paths.append([candidate] + branch)
        known_paths[next_value] = paths
        return paths
    return all_paths_starting_from(-1)

Here an example in a graph with some more branching and merging (and dead ends):

test_input = [(0, 2, ['1']),
              (0, 3, ['2']),
              (0, 5, ['3']),
              (2, 7, ['4']),
              (3, 8, ['5']),
              (4, 9, ['6']),
              (8, 13, ['7']),
              (9, 14, ['8']),
              (10, 11, ['9']),
              (14, 15, ['10']),
              ]

from pprint import pprint
pprint(find_all_paths(test_input))

and the result is:

[(0, 2, ['1']), (2, 7, ['4']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 2, ['1']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 2, ['1']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
 [(0, 3, ['2']), (3, 8, ['5']), (8, 13, ['7']), (14, 15, ['10'])],
 [(0, 3, ['2']), (3, 8, ['5']), (9, 14, ['8']), (14, 15, ['10'])],
 [(0, 3, ['2']), (4, 9, ['6']), (9, 14, ['8']), (14, 15, ['10'])]]
nonDucor
  • 2,057
  • 12
  • 17