19

I have a list of vectors as follows.

data <- list(v1=c("a", "b", "c"), v2=c("g", "h", "k"), 
             v3=c("c", "d"), v4=c("n", "a"), v5=c("h", "i"))

I am trying to achieve the following:

  1. Check whether any of the vectors intersect with each other
  2. If intersecting vectors are found, get their union

So the desired output is

out <- list(v1=c("a", "b", "c", "d", "n"), v2=c("g", "h", "k", "i"))

I can get the union of a group of intersecting sets as follows.

 Reduce(union, list(data[[1]], data[[3]], data[[4]]))
 Reduce(union, list(data[[2]], data[[5]])

How to first identify the intersecting vectors? Is there a way of dividing the list into lists of groups of intersecting vectors?

#Update

Here is an attempt using data.table. Gets the desired results. But still slow for large lists as in this example dataset.

datasets. 
data <- sapply(data, function(x) paste(x, collapse=", "))
data <- as.data.frame(data, stringsAsFactors = F)

repeat {
  M <- nrow(data)
  data <- data.table( data , key = "data" )
  data <- data[ , list(dataelement = unique(unlist(strsplit(data , ", " )))), by = list(data)]
  data <- data.table(data , key = "dataelement" )
  data <- data[, list(data = paste0(sort(unique(unlist(strsplit(data, split=", ")))), collapse=", ")), by = "dataelement"]
  data$dataelement <- NULL
  data <- unique(data)
  N <- nrow(data)
  if (M == N)
    break
}

data <- strsplit(as.character(data$data) , "," )
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Crops
  • 5,024
  • 5
  • 38
  • 65

6 Answers6

24

This is kind of like a graph problem so I like to use the igraph library for this, using your sample data, you can do

library(igraph)
#build edgelist
el <- do.call("rbind",lapply(data, embed, 2))
#make a graph
gg <- graph.edgelist(el, directed=F)
#partition the graph into disjoint sets
split(V(gg)$name, clusters(gg)$membership)

# $`1`
# [1] "b" "a" "c" "d" "n"
# 
# $`2`
# [1] "h" "g" "k" "i"

And we can view the results with

V(gg)$color=c("green","purple")[clusters(gg)$membership]
plot(gg)

enter image description here

MrFlick
  • 195,160
  • 17
  • 277
  • 295
17

Here's another approach using only base R

Update

Next update after akrun's comment and with his sample data:

data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))

Modified function:

x <- lapply(seq_along(data), function(i) {
  if(!any(data[[i]] %in% unlist(data[-i]))) {
    data[[i]]
  } else if (any(data[[i]] %in% unlist(data[seq_len(i-1)]))) {
    NULL 
  } else {
    z <- lapply(data[-seq_len(i)], intersect,  data[[i]]) 
    z <- names(z[sapply(z, length) >= 1L])
    if (is.null(z)) NULL else union(data[[i]], unlist(data[z]))
  }
})
x[!sapply(x, is.null)]
#[[1]]
#[1] "g" "k"
#
#[[2]]
#[1] "a" "b" "c" "d"

This works well with the original sample data, MrFlick's sample data and akrun's sample data.

talat
  • 68,970
  • 21
  • 126
  • 157
  • This is great. But how to ensure that the vectors without any intersections are also retained in the final output. – Crops Dec 17 '14 at 09:54
  • 1
    @Crops, good point! I updated my answer with a modified function – talat Dec 17 '14 at 10:09
  • 1
    This seems to behave poorly with the sample data: `data <- list(v1=c("a", "b"), v2=c("b", "c"), v3=c("a", "d"), v4=c("g", "k"), v5=c("c", "d"))`. It also returns incomplete subsets of the data as proper groups. – MrFlick Dec 17 '14 at 18:44
  • 1
    Good catch, @MrFlick! I updated my answer once more. – talat Dec 17 '14 at 19:28
  • 1
    @docendodiscimus Just saw your update. But, it still doesn't seem to work for `data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))`. BTW, I didn't change my code, so it won't work with mine either. – akrun Dec 18 '14 at 06:52
  • 1
    @akrun, I provided another update of my answer. You might want to check yours too for those cases – talat Dec 18 '14 at 07:54
  • @docendodiscimus I will once I get some time. – akrun Dec 18 '14 at 09:20
  • @docendodiscimus I am trying this with the following [data](http://ge.tt/4NI4CL72/v/0?c?c) and getting irregular results. The maximum `length` of a vector in `data` is 83 while that in `x` after `union` is just 10. There were many vectors in `data` with `length` > 10. Moreover `unique(unlist(x))` is not identical to `unique(unlist(data))`. – Crops Dec 23 '14 at 11:02
  • 1
    @Crops, I will check that, but the data you link to is a data.frame with 1 column. Each row has _one_ string as it's formatted currently. How do you turn this into something similar to the list as in your question? Can you perhaps provide a text file with the output of `dput(data)` where `data` is already formatted correctly as list? That would ensure that we're both working with the same data – talat Dec 23 '14 at 11:14
  • @docendodiscimus I use `data <- strsplit(as.character(data[,1]) , "," )`. This will convert the column to the list of character vectors. – Crops Dec 23 '14 at 11:22
  • @docendodiscimus Here is the `dput(data)` as a [text file](http://ge.tt/4NI4CL72/v/1?c). – Crops Dec 23 '14 at 11:29
  • Ok, got it. Let me check what's going on there – talat Dec 23 '14 at 11:30
  • @Crops, I think the problem was that your input list is not a named list and I'm making use of the list names in my answer. Add this line `names(data) <- paste0("V", seq_along(data))` before you run my code and let me know if it produces the desired result. – talat Dec 23 '14 at 12:01
  • 1
    @Crops, There is still something going on that's not intended. I will work on fixing that but don't have enough time right now. Feel free to accept MrFlick's clean answer instead. – talat Dec 23 '14 at 13:19
9

Efficiency be damned and do you people even sleep? Base R only and much slower than the fastest answer. Since I wrote it, might as well post it.

f.union = function(x) {
  repeat{
    n = length(x)
    m = matrix(F, nrow = n, ncol = n)
    for (i in 1:n){
      for (j in 1:n) {
        m[i,j] = any(x[[i]] %in% x[[j]])
      }
    }
    o = apply(m, 2, function(v) Reduce(union, x[v]))
    if (all(apply(m, 1, sum)==1)) {return(o)} else {x=unique(o)}
  }
}

f.union(data)

[[1]]
[1] "a" "b" "c" "d" "n"

[[2]]
[1] "g" "h" "k" "i"

Because I like being slow. (loaded library outside of benchmark)

Unit: microseconds
    expr      min        lq      mean    median        uq       max neval
   vlo()  896.435 1070.6540 1315.8194 1129.4710 1328.6630  7859.999  1000
 akrun()  596.263  658.6590  789.9889  694.1360  804.9035  3470.158  1000
 flick()  805.854  928.8160 1160.9509 1001.8345 1172.0965  5780.824  1000
  josh() 2427.752 2693.0065 3344.8671 2943.7860 3524.1550 16505.909  1000 <- deleted :-(
   doc()  254.462  288.9875  354.6084  302.6415  338.9565  2734.795  1000
Vlo
  • 3,168
  • 13
  • 27
  • I am too tired. Going to sleep. I think there was a fifth answer that required RGBL but I may have just been imagining things. Your solution is the fastest. – Vlo Dec 17 '14 at 09:06
  • 1
    ok, RBGL is some kind of cool bioconductor package. Just ran the fifth answer that the author deleted. – Vlo Dec 17 '14 at 09:10
  • "Coffee! Because you can sleep when you're dead!" (not mine; easily found at CafePress, e.g.) – Carl Witthoft Dec 17 '14 at 12:34
8

One option would be to use combn and then find the intersects. There would be easier options.

indx <- combn(names(data),2)
lst <- lapply(split(indx, col(indx)), 
        function(i) Reduce(`intersect`,data[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
lapply(split(1:ncol(indx3), indx3[1,]),
   function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE)))
#$v1
#[1] "a" "b" "c" "d" "n"

#$v2
#[1] "g" "h" "k" "i"

Update

You could use combnPrim from library(gRbase) to make this even faster. Using a slightly bigger dataset

library(gRbase)
set.seed(25)
data <- setNames(lapply(1:1e3,function(i)sample(letters,
         sample(1:20), replace=FALSE)), paste0("v", 1:1000))

and comparing with the fastest. These are modified functions based on OP's comments to @docendo discimus.

akrun2M <- function(){
     ind <- sapply(seq_along(data), function(i){#copied from @docendo discimus
            !any(data[[i]] %in% unlist(data[-i]))
              })
     data1 <- data[!ind] 
     indx <- combnPrim(names(data1),2)
     lst <- lapply(split(indx, col(indx)), 
              function(i) Reduce(`intersect`,data1[i]))
     indx1 <- names(lst[sapply(lst, length)>0])
     indx2 <- indx[,as.numeric(indx1)]
     indx3 <- apply(indx2,2, sort)
     c(data[ind],lapply(split(1:ncol(indx3), indx3[1,]),
        function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE))))
   } 

doc2 <- function(){
      x <- lapply(seq_along(data), function(i) {
          if(!any(data[[i]] %in% unlist(data[-i]))) {
               data[[i]]
           } 
          else {
            z <- unlist(data[names(unlist(lapply(data[-c(1:i)],
                                     intersect, data[[i]])))]) 
          if (is.null(z)){ 
               z
               }
          else union(data[[i]], z)
        }
   })
x[!sapply(x, is.null)]
}

Benchmarks

 microbenchmark(doc2(), akrun2M(), times=10L)
 # Unit: seconds
 #    expr      min       lq     mean   median       uq      max neval  cld
 #   doc2() 35.43687 53.76418 54.77813 54.34668 62.86665 67.76754    10   b
 #akrun2M() 26.64997 28.74721 38.02259 35.35081 47.56781 49.82158    10   a 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I wrote this a couple days ago, just wanted to let you know that I have taken the last two lines from a solution of yours, so I would like to thank you for that. It's stable but quite slow. I was thinking about replacing all `for` loops with `lapply` or something similar. – Anoushiravan R Jun 19 '21 at 20:45
  • 1
    @AnoushiravanR I can't imagine I wrote this. Looks all foreign to me (after several years) :-) – akrun Jun 19 '21 at 20:47
  • 1
    hahaha just wondering what I was doing when you wrote this some years ago. I was in the total darkness. I was looking for an alternative solution to `igraph` to a question of mine and came up with this, but I have to learn `igraph` it's so useful. it all started from here: https://stackoverflow.com/questions/27520310/union-of-intersecting-vectors-in-a-list-in-r – Anoushiravan R Jun 19 '21 at 20:50
  • 1
    I don;t want to take your time anymore dear Arun you have been very generous today as always. I don't know how to thank you. – Anoushiravan R Jun 19 '21 at 20:51
4

I came across a similar problem that prompted me to look everywhere for a solution. I finally found a very good one thanks to a number of great contributors here, however as I seen this post I thought I would write my own custom function for this purpose. It's not actually elegant and is too slow but I think it's quite effective and can do the trick for now until I make some improvements:

anoush <- function(x) {
# First we check whether x is a list

  stopifnot(is.list(x)) 

# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices 
# in `vec` vector. So in the end we have a list called `ind` whose elements 
# are all the indices connected with the corresponding elements of the original 
# list for example first element of `ind` is `1`, `2`, `3` which means in 
# the original list these elements have common values.
  
  ind <- lapply(1:length(x), function(a) {
    vec <- c()
    for(i in 1:length(x)) {
      if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
        vec <- c(vec, i)
      }
    }
    vec 
    })

# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also 
# result in duplicated values. For example elements `1` through `5` of 
# `dup_ind` contains the same value cause in the original list these 
# elements have common values.

  dup_ind <- lapply(1:length(ind), function(a) {
    out <- c()
    for(i in 1:length(ind)) {
      if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
        out[[i]] <- union(ind[[a]], ind[[i]])
      }
      vec2 <- Reduce("union", out)
    }
    vec2
  }) 

# Here we get rid of the duplicated elements of the list by means of 
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
  
  un <- unlist(dup_ind)
  res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
  res2 <- Filter(length, res)
  
  sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
  
}

OP's Data Sample

> anoush(data)

[[1]]
[1] "a" "b" "c" "d" "n"

[[2]]
[1] "g" "h" "k" "i"

Dear @akrun's Data Sample

data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))

> anoush(data)
[[1]]
[1] "g" "k"

[[2]]
[1] "a" "b" "c" "d"
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
0

In general, you cannot do much better/faster than Floyd-Warshall-Algorithm, which is as follows:

library(Rcpp)

cppFunction(
  "LogicalMatrix floyd(LogicalMatrix w){
    int n = w.nrow();
    for( int k = 0; k < n; k++ )
     for( int i = 0; i < (n-1); i++ )
      for( int j = i+1; j < n; j++ ) 
       if( w(i,k) && w(k,j) ) {
        w(i,j) = true;
        w(j,i) = true;
       }
   return w;
}")

fw.union<-function(x) {
  n<-length(x)
  w<-matrix(F,nrow=n,ncol=n)
  for( i in 1:n ) {
   w[i,i]<-T
  }
  for( i in 1:(n-1) ) {
   for( j in (i+1):n ) {
     w[i,j]<-w[j,i]<- any(x[[i]] %in% x[[j]])
   }
  }
 apply( unique( floyd(w) ), 1, function(y) { Reduce(union,x[y]) } )
}

Running benchmarks would be interesting, though. Preliminary tests suggest that my implementation is about 2-3 times faster than Vlo's.

Funkwecker
  • 766
  • 13
  • 22