How to consolidate ordered lists into one, keeping the order (using R)

Question

I have a number of ordered lists (or sequences, or vectors, or data table columns) 1, 2, 3, with several items, for example

1 2 3
A A B
G G A
F F G
C E
D C
  D

How can I efficiently derive the "master" list which contains all elements in the correct order B, A, G, F, E, C, D? I don't even know what keywords to search for. Any hints are much appreciated.

Can you post a example of the output you would like to have? — Gainz, Sep 17 '19 at 14:28
Interesting question. If you had to do it with pen and paper, what would you do? — Bill O'Brien, Sep 17 '19 at 14:52
I'd compare 1 and 2 element-wise until they no longer match (4th row). Then I'd shift the remainder of 1 down, filling the gaps with NA, until the elements match again. — mcenno, Sep 17 '19 at 15:34
How is the ordering defined? What I mean is what if you were not able to determine how 2 elements compare from the ordered list? For example, if one of the vectors had `J < B` and `E < B`, but there is nothing to determine hoe `E` and `J` relate. If you know how to compare elements regardless of the ordered list given, this problem could be solved easily. — Joseph Wood, Sep 17 '19 at 22:48
@JosephWood - through the transitive property of inequalities. The example in the question appears to be internally consistent. — Bill O'Brien, Sep 17 '19 at 23:12
@BillO'Brien, indeed it is consistent. However, the general case can't be guaranteed. I can solve this specific question quite simply, however in writing a general solution, I came across many "problem" situations. For example, how would you solve `list(c("W", "J", "D", "S"), c("U" "J", "B", "S"))`? Is `U > W` or `B > D`, etc.? — Joseph Wood, Sep 17 '19 at 23:55
This appears to be a case of [multiple sequence alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment), a pretty hard problem. — AkselA, Sep 18 '19 at 07:51

Maurits Evers · Accepted Answer · 2019-09-18T09:25:09.570

How about a graph-based approach.

Idea

The idea is to translate the sequences into paths in a directed graph (so A G F C D becomes a path A->G->F->C->D). By simplifying the graph we can then identify the longest connected sequence in that graph, which should then correspond to your "master" sequence.

Implementation

Note that I assume your sample data lst to be a list of vectors (see sample data at the end of this answer).

Let's construct an igraph from the different paths; each path is given by the entries in the lst vectors.

library(igraph)
ig <- make_empty_graph(
    n = length(unique(unlist(lst))),
    directed = TRUE) %>%
    set_vertex_attr("name", value = sort(unique(unlist(lst))))

for (i in 1:length(lst)) ig <- ig + path(lst[[i]])

Next we simplify the graph
```
ig <- simplify(ig)
```
It's instructive to plot the graph
```
plot(ig)
```

We now extract all simple paths; the longest simple path corresponds to the "master" list.

pths <- sapply(V(ig), function(x) {
    p <- all_simple_paths(ig, x)
    names(unlist(p[which.max(lengths(p))]))
})

pths[which.max(lengths(pths))]
$B
#[1] "B" "A" "G" "F" "E" "C" "D"

The sequence matches your expected output for the master list.

Sample data

v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")

lst <- list(v1, v2, v3)

I don't understand this solution (have never worked with graphs before), but it looks elegant. I'll see if I can implement it. Many thanks! — mcenno, Sep 18 '19 at 06:59
@mcenno I've added a few more details to elaborate on the key idea. — Maurits Evers, Sep 18 '19 at 09:25

score 0 · Answer 2 · answered Sep 18 '19 at 05:03

Interesting problem. I think I have a working solution.

My thinking was that we can encode the vectors into a matrix to track which letters must come before and after each other letter by logic. Then we should be able to sort that matrix to find a working order.

Here, I take the three vectors and encode their implied ordering using nested loops.

v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")

vecs <- list(v1, v2, v3)
unique_ltrs <- unique(unlist(vecs))
ltr_len <- length(unique_ltrs)
m <- matrix(0, nrow = ltr_len, ncol = ltr_len, 
       dimnames = list(unique_ltrs, unique_ltrs))

# Loops to populate m with what we know
for (v in 1:length(vecs)) {
  vec <- unique(unlist(vecs[v]))
  for (l in 1:length(vec)) {
    for (l2 in 1:length(vec)) {
      m_pos <- c(match(vec[l], unique_ltrs),
                 match(vec[l2], unique_ltrs))
      compare <- ifelse(l < l2, -1, ifelse(l2 < l, 1, 0))
      m[m_pos[1], m_pos[2]] <- compare
    }
  }
}

Here, 1 indicates the column letter comes before the row letter, while -1 means the row comes first.

> m
   A  G  F  C  D  E B
A  0 -1 -1 -1 -1 -1 1
G  1  0 -1 -1 -1 -1 1
F  1  1  0 -1 -1 -1 0
C  1  1  1  0 -1  1 0
D  1  1  1  1  0  0 0
E  1  1  1 -1  0  0 0
B -1 -1  0  0  0  0 0

Then we sort the matrix (relying on the code here), and a working order appears in the rownames:

m_ord <- m[do.call(order, as.data.frame(m)),]
#> m_ord
#   A  G  F  C  D  E B
#B -1 -1  0  0  0  0 0
#A  0 -1 -1 -1 -1 -1 1
#G  1  0 -1 -1 -1 -1 1
#F  1  1  0 -1 -1 -1 0
#E  1  1  1 -1  0  0 0
#C  1  1  1  0 -1  1 0
#D  1  1  1  1  0  0 0
rownames(m_ord)
#[1] "B" "A" "G" "F" "E" "C" "D"

I like this solution! Many thanks for putting thought into a random stranger's problem. — mcenno, Sep 18 '19 at 06:58
Not sure why the downvote. Interesting answer for an interesting question. (+1) — Maurits Evers, Sep 19 '19 at 08:08

How to consolidate ordered lists into one, keeping the order (using R)

2 Answers2

Idea

Implementation

Sample data