Python: Find underlying sequence of word/strings based on list incomplete sequences

Question

Suppose following list of lists with strings (not necessarily letters):

[['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]

Each item in the list represents categorical data with underlying order like letters from the alphabet. Each string has a precursor and a successor (except for the first and last one) As you can see, some of the items ['a', 'b', 'd', 'f'] are nearly complete. This particular item does not contain the letter e for example. The item ['b', 'd', 'e', 'f'] does contain the letter e (presumably in correct order) but not the letter a. Together the items do contain the information about the underlying sequence of strings but none of the items alone can provide this information. I should mention that the letters are just an exammple. Otherwise, sorting would be easy.

I would like to obtain the unique sorted items based on alignment (alignment in the sense of sequence alignment of those lists. Like so:

['a', 'b', 'd', 'e', 'f']

I am sure this is a common problem which has been solved before but I have a hard time finding similar cases. This SO thread deals with a similar issue but the ground truth is known. Here I would like to find the underlying order of strings.

Unfortunately, the longest sequence is not guaranteed to start with e.g. 'a'

I had a look at difflib but I am not sure if this is the right toolbox. Any hints are appreciated.

EDIT:

I found a solution based on NetworkX

import networkx as nx

l = [['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]

# get tuples (start, stop)
new_l = []
for item in l:
    for index, subitem in enumerate(item):
        if len(item) > 1:
            if index < len(item)-1:
                new_l.append((subitem, item[index+1]))
# create a graph using those tuples
uo_graph = nx.DiGraph()
for item in new_l:
    uo_graph.add_edge(*item)


[item for item in nx.topological_sort(uo_graph)]

Out[10]: ['a', 'b', 'd', 'e', 'f']

Would be interesting if there are more pythonic solutions to this kind of problem. Especially, it would be interesting to know how to apply a check if there are multiple solutions.

If you mean [sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment) you should probably say so, because not all of us are biologists. — mapf, May 11 '20 at 13:10
Tried to make it more clear. It would be nice if you vote for reopening the question. — Moritz, May 11 '20 at 15:18
I mean, it looks like you found the solution yourself, which is great, but I have to be honest, I still don't understand the question. That might just be me though. I still have no idea how you get from your input list to the output list, or what the output list even *represents* with respect to the input list, but that is probably because I don't understand anything about sequence alignment. — mapf, May 11 '20 at 15:30
I see, I think, now I get what you mean! But isn't what you are looking for simple set logic? — mapf, May 11 '20 at 16:16
That is exactly, why I am asking the question. I thought this should be rather simple. But how do you keep track of the underlying order if you use sets? — Moritz, May 12 '20 at 09:01
Ah, so, the order isn't necessarily alphabetical? But if it's not, you *have* to know the order prior in the first place, because it would be wrong to try and determine the order from the dataset. If you had `['a', 'b', 'c']` and `['a', 'b', 'd']` as subsets and the order is *not* alphabetical (or in determined in any other way for that matter), there is no way to say if the order would be `['a', 'b', 'c', 'd']` or `['a', 'b', 'd', 'c']`. Either case would be equally likely. In the same vein, you wouldn't even be able to say if there are supposed to be other elements in between `'a'` and `'b'` — mapf, May 12 '20 at 10:09
Yes, that is why I wrote that all pieces together contain all the information to put the sequence together. If the solution is ambigious, there is a problem somewhere else. But it is a good point. I should not rely on that. — Moritz, May 12 '20 at 16:04
If all the information about the order is truely contained in the dataset, then this is more like a riddle, but solvable using logic. — mapf, May 12 '20 at 16:35

mapf · Answer 1 · 2021-07-18T14:37:27.027

Ok, I just couldn't stop thinking about this problem, and now I have finally found the solution I was looking for. This only works if your problem is in deed well-definied, meaning that a) there are no contradictions in your dataset, and b) that there is only one order that can be derived from it, with no room for ambiguity.

Based on this assumption, I noticed that your problem looks very similar to a zebra, or Einstein puzzle. I knew about this type of riddle, but it took me a very long time to figure out what it was actually called.

At first, I thought you could tackle this type of problem by solving a system of linear equations, but soon I realized that it is not that simple. So then I searched for Python libraries that can solve these kinds of problems. To my surprise, there don't seem to exist any. Eventually I came across this reddit post, where somebody asks about how to solve Einstein's riddle in Python, and somebody in the comments points out this rather recent video of a talk by Raymond Hettinger about how you can use Python to solve well-defined problems. The interesting part is when he starts to talk about SAT solvers. Here is the link to the relevant part of his slides. These kind of solvers are exactly what you need for your problem. We are very lucky that he made all this information so accessible (here is the github repo), because otherwise, I probably would not have been able to come up with a solver adjusted to your problem.

For some reason, there is one crucial part missing in the documentation though, which are the custom convenience functions he came up with. They were pretty hard to find, at some point, I was even thinking about sending him a message somehow, and ask for where to find them. Fortunately, that wasn't necessary, because eventually, I found them here.

With all the tools in place, I only needed to modify the solver according to your problem. The most difficult part was to come up with / translate the rules that determine how the order is retrieved from your dataset. There is really only one rule though, which is the consecutive rule, and kinda goes like this:

For any consecutive item pair in your dataset, in the order, the second item can only be placed at positions after the first one. There can be other items in between them, but there don't have to be.

It took quite some trial and error to get the implementation right.

Finally, here is my version of the SAT solver that solves your problem:

import numpy as np
import itertools
from pprint import pprint
from sys import intern

from sat_utils import solve_one, from_dnf, one_of


datasets = [
    ['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'],
    ['d', 'e', 'f'], ['f'], ['d', 'f']
]

values = set(itertools.chain.from_iterable(datasets))
positions = np.arange(len(values))
order = np.empty(positions.size, dtype=object)


def comb(value, position):
    """Format how a value is shown at a given position"""
    return intern(f'{value} {position}')


def found_at(value, position):
    """Value known to be at a specific position"""
    return [(comb(value, position),)]


def consecutive(value1, value2):
    """
    The values are in consecutive positions:  a & b | a & _ & b |
    a & _ & _ & b ...

    """
    lst = [
        (comb(value1, i), comb(value2, j+i+1)) for i in range(len(positions))
        for j in range(len(positions[i+1:]))
    ]
    return from_dnf(lst)


cnf = []

# each value gets assigned to exactly one position
for value in values:
    cnf += one_of(comb(value, position) for position in positions)

# for each consecutive value pair, add all potential successions
for subset in datasets:
    for value1, value2 in zip(subset, subset[1:]):
        cnf += consecutive(value1, value2)


solution = solve_one(cnf)
for pair in solution:
    value, position = pair.split()
    order[int(position)] = value
pprint(order)  # array(['a', 'b', 'd', 'e', 'f'], dtype=object)

I am sure this can still be optimized here and there, and I know the code is a lot longer than your solution, but I think from a technical standpoint, using a SAT solver is a very nice approach. And if you believe what Raymond Hettinger says in his talk, this is also pretty fast and scales very well.

Please make sure to test this thoroughly though, because I cannot guaratee that I didn't make any mistakes.

As a side note: In order to determine the unique items of your sample dataset, I used the nice trick pointed out in the comments here.

score 0 · Answer 2 · answered May 11 '20 at 13:11

0

x = [['b', 'd'], ['b', 'd', 'e', 'f], ['a', 'b', 'd', 'f], ['b', 'd', 'e'], [['d', 'e', 'f'], ['f'], ['d', 'f']]
for l in x:
    l.sort()
#if you want all the elements inside in order...elements should be of the same type
q = set()
for l in x:
    q.update(l)
s = list(q)
s.sort()

answered May 11 '20 at 13:11

Narcisse Doudieu Siewe

1,074
1
7
9

unfortunatley, this was a bad example. The strings are not necessarily letters. – Moritz May 11 '20 at 13:13

Python: Find underlying sequence of word/strings based on list incomplete sequences

2 Answers2