4

Given a data.frame:

df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
                 grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))

#> df
#   grp1 grp2
#1     1    1
#2     1    2
#3     1    3
#4     2    3
#5     2    4
#6     2    5
#7     3    6
#8     3    7
#9     3    8
#10    4    6
#11    4    9
#12    4   10

Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.

Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.

Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).

Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.

The final output that I'm hoping to obtain would look something like:

# > df
#    grp1 grp2 combinedGrp
# 1     1    1           1
# 2     1    2           1
# 3     1    3           1
# 4     2    3           1
# 5     2    4           1
# 6     2    5           1
# 7     3    6           2
# 8     3    7           2
# 9     3    8           2
# 10    4    6           2
# 11    4    9           2
# 12    4   10           2

Thank you for any direction on this topic!

jaimedash
  • 2,683
  • 17
  • 30
Andrew Taylor
  • 3,438
  • 1
  • 26
  • 47

4 Answers4

5

I would define a graph and label nodes according to connected components:

gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))

oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i)  with(gmap[gmap$ind == i, ], 
  node[ match(df[[i]], values) ]
))

library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership

df$group = gmap$group[ match(df$node_grp1, gmap$node) ]


   grp1 grp2 node_grp1 node_grp2 group
1     1    1         1         5     1
2     1    2         1         6     1
3     1    3         1         7     1
4     2    3         2         7     1
5     2    4         2         8     1
6     2    5         2         9     1
7     3    6         3        10     2
8     3    7         3        11     2
9     3    8         3        12     2
10    4    6         4        10     2
11    4    9         4        13     2
12    4   10         4        14     2

Each unique element of grp1 or grp2 is a node and each row of df is an edge.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • I marked this one as correct as it works on the data provided. My issue (with this response and all other response submitted using igraph) is that they do not work on my real dataset. For the igraph answers, my `newcols` end up being mostly all `NA` causing the `graph_from_edgelist()` to fail. Can you think of a reason this would be? All groupings are non-NA. There are many many unique entries that I'm expecting not to be able to be grouped. Would that be an issue? – Andrew Taylor Jun 08 '16 at 21:25
4

One way to do this is via a matrix that defines links between rows based on group membership.

This approach is related to @Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, @Frank's answer would work better for very long data, or if you only ever have two columns.

The steps are

  1. compare rows based on groups and define these rows as linked (i.e., create a graph)
  2. determine connected components of the graph defined by the links in 1.

You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.

1. construct an adjacency matrix (matrix of pairwise links) between rows (i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked

linked_rows <- function(data){
  ## helper function
  ## returns a _function_ to compare two rows of data
  ##  based on group membership.

  ## Use Vectorize so it works even on vectors of indices
  Vectorize(function(i, j) {
    ## numeric: 1= i and j have overlapping group membership
    common <- vapply(names(data), function(name)
                     data[i, name] == data[j, name],
                     FUN.VALUE=FALSE)
    as.numeric(any(common))
  })
}

which I use in outer to construct a matrix,

rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df)) 

2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.

One iteration involves: i) matrix multiply to get the square of A, and ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)

## define as a function to use below
lump_links <- function(A) {
  A <- A %*% A
  A[A > 0] <- 1
  A
}

repeat this till the links are stable

oldA <- 0
i <- 0
while (any(oldA != A)) {
  oldA <- A
  A <- lump_links(A)
}

2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.

One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):

df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df

In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:

lump <- function(df) {
  rows <- 1:nrow(df)
  A <- outer(rows, rows, linked_rows(df))

  oldA <- 0
  while (any(oldA != A)) {
    oldA <- A
    A <- lump_links(A)
  }
  df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
  df
}

This works for the original df and also for the structure in @rawr's answer

df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
                 grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)

   grp1 grp2 combinedGrp
1     1    1           1
2     1    2           1
3     1    3           1
4     2    3           1
5     2    4           1
6     2    5           1
7     3    6           2
8     3    7           2
9     3    8           2
10    4    6           2
11    4    9           2
12    4   10           2
13    5   11           1
14    5    3           1
15    6   12           3
16    7    3           1
17    8    6           2
18    9   12           3

PS

Here's a version using igraph, which makes the connection with @Frank's answer more clear:

  lump2 <- function(df) {
      rows <- 1:nrow(df)
      A <- outer(rows, rows, linked_rows(df))
      cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
      df$combinedGrp <- cluster_A$membership
      df
    }
jaimedash
  • 2,683
  • 17
  • 30
  • 1
    Yup, in fact there is an algorithm for finding connected components from an adjacency matrix: http://math.stackexchange.com/q/1106870/ I tested my code with the extended example, and it seems to extend to 3+ groups just fine, by the way. – Frank Apr 17 '16 at 01:15
  • 1
    Thanks. Also I edited to clarify, but I meant >2 columns. Your approach would work fine there too, of course, but require some editing of the code that creates the edge lists – jaimedash Apr 17 '16 at 15:51
  • So far this is the only solution that works on my real life dataset. The problem being that it takes far too long. Testing has shown that 100 rows take 5 seconds, 1k rows takes 3 minutes, and 2k rows takes 20 minutes to run. Real dataset currently is 56k rows and it will be getting updated every time a shiny app runs so I was hoping for something much quicker. Looks like I'm just going to have to use 1 of the two grouping methods. Thank you though – Andrew Taylor Jun 08 '16 at 21:20
  • 1
    Hm, I don't know @AndrewTaylor . If you're interested in an igraph solution that works with your real data, you might need to post a new question with new example data to show where these approaches fail. (Oh, meant to reply under my answer.) – Frank Jun 08 '16 at 21:32
2

Hope this solution helps you a bit:

Assumption: df is ordered on the basis of grp1.

## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)

parent <- vector('integer',length(split_df))

## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
    for (j in seq(i+1,length(split_df))){
        inter <- intersect(split_df[[i]],split_df[[j]])

        if (length(inter) > 0){
            parent[j] <- i
        }
    }
}

ans <- vector('list',length(split_df))

index <- which(parent == 0)

## index contains indices of elements that have no element common
for (i in seq_along(index)){
    ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}

rest_index <- seq(1,length(split_df))[-index]

for (i in rest_index){
    val <- ans[[parent[i]]][1]
    ans[[i]] <- rep(val,length(split_df[[i]]))
}

df$combinedGrp <- unlist(ans)

df

   grp1 grp2 combinedGrp
1     1    1           1
2     1    2           1
3     1    3           1
4     2    3           1
5     2    4           1
6     2    5           1
7     3    6           2
8     3    7           2
9     3    8           2
10    4    6           2
11    4    9           2
12    4   10           2
Kunal Puri
  • 3,419
  • 1
  • 10
  • 22
  • 1
    Fyi, you can use `seq_len(length(split_df)-1)` as the iterator instead of `seq(1,length(split_df)-1)`. That way, you don't end up having `seq(1, 0)` if `split_df` happens to have length of 1 (since that evaluates to `c(1,0)`, annoyingly). – Frank Apr 16 '16 at 02:19
  • 1
    @Frank Thanks for your enlightenment. Though in this case, if split_df's length becomes 1, there is no need for any computation and an if statement can check that thing, Thanks a lot. From now on, I will consider `seq_len` option first while encountering such situations. – Kunal Puri Apr 16 '16 at 02:26
  • Oddly, this method, when tested with 1000 rows of my real dataset created an `unlist(ans)` that as 1005 items low. Can you think of any reason this would happen? – Andrew Taylor Jun 08 '16 at 21:00
0

Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():

library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g  <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
    for(j in 1:length(gr)){
      if(i %in% gr[[j]]){
        nc[i,'group'] <- j
      }
    }
  }
# Make a new sfc object
nc_un <- group_by(nc, group) %>% 
    summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

enter image description here

Matt
  • 490
  • 8
  • 19