Suppose following list of lists with strings (not necessarily letters):
[['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
Each item in the list represents categorical data with underlying order like letters from the alphabet. Each string has a precursor and a successor (except for the first and last one) As you can see, some of the items ['a', 'b', 'd', 'f']
are nearly complete. This particular item does not contain the letter e
for example. The item ['b', 'd', 'e', 'f']
does contain the letter e
(presumably in correct order) but not the letter a
. Together the items do contain the information about the underlying sequence of strings but none of the items alone can provide this information. I should mention that the letters are just an exammple. Otherwise, sorting would be easy.
I would like to obtain the unique sorted items based on alignment (alignment in the sense of sequence alignment of those lists. Like so:
['a', 'b', 'd', 'e', 'f']
I am sure this is a common problem which has been solved before but I have a hard time finding similar cases. This SO thread deals with a similar issue but the ground truth is known. Here I would like to find the underlying order of strings.
Unfortunately, the longest sequence is not guaranteed to start with e.g. 'a'
I had a look at difflib but I am not sure if this is the right toolbox. Any hints are appreciated.
EDIT:
I found a solution based on NetworkX
import networkx as nx
l = [['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
# get tuples (start, stop)
new_l = []
for item in l:
for index, subitem in enumerate(item):
if len(item) > 1:
if index < len(item)-1:
new_l.append((subitem, item[index+1]))
# create a graph using those tuples
uo_graph = nx.DiGraph()
for item in new_l:
uo_graph.add_edge(*item)
[item for item in nx.topological_sort(uo_graph)]
Out[10]: ['a', 'b', 'd', 'e', 'f']
Would be interesting if there are more pythonic solutions to this kind of problem. Especially, it would be interesting to know how to apply a check if there are multiple solutions.