R: Identifying Data Frame Rows Connected By Shared Values In Two Columns

Question

My objective is to identify "connected" rows within a single data frame based on the shared values between two columns in R.

In this example, there are 10 unique segments (i.e., clusters of data) which are identified by integers corresponding to each unique segment. Each row represents two segments which were already determined to be within a certain distance threshold of each other. There is no significant distinction between the columns "segA" and "segB", they are just used to keep track of the pairs of segments which are connected. The column "dist" represents the distance between the pair of segments, but is not really needed at this point, as the data frame only contains those pairs of segments which are deemed "connected."

I'm trying to figure out a way of identifying all of the rows which have at least one shared value in "segA" or "segB", indicating a connected segment between rows.

My initial attempts have been convoluted for loops and logical statements (I'm new to R programming), so I would greatly appreciate any concise solutions!

Example:

 df = data.frame(
  segA = c(1, 1, 2, 4, 6, 7, 9),
  segB = c(2, 3, 4, 5, 8, 8, 10),
  dist = c(0.5321, 0.3212, 0.4351, 0.1421, 0.5125, 0.1692, 0.3218)
 )

df
  segA segB   dist
1    1    2 0.5321
2    1    3 0.3212
3    2    4 0.4351
4    4    5 0.1421
5    6    8 0.5125
6    7    8 0.1692
7    9   10 0.3218

Rows 1 and 2 are connected because they both contain segment "1".

Rows 3 and 1 are connected because they both contain segment "2", etc.

Even though rows 2 and 3 aren't directly connected by the presence of shared segments, they are connected,overall, by their mutual connection through row 1.

The desired final output would be something like:

(1) = 1, 2, 3, 4, 5  
(2) = 6, 7, 8  
(3) = 9, 10

where (1), (2), and (3) represent the distinct overall segments and their components which are directly/mutually connected.

It appears you have a connected network problem. I have never used it, but maybe the igraph package or something similar would be useful. — Dave2e, May 12 '16 at 03:03
Sometimes knowing the name of the problem is half the battle. Thank you for that. — Gerald, May 12 '16 at 03:48

bgoldst · Accepted Answer · 2016-05-12T03:14:28.497

## helper function for merging vector elements of a list
merge.elems <- function(x,i,j) {
    c(
        x[seq_len(i-1L)], ## before i
        list(unique(c(x[[i]],x[[j]]))), ## combined i,j
        x[seq_len(j-i-1L)+i], ## between i,j
        x[seq_len(length(x)-j)+j] ## after j
    );
}; ## end merge.elems()

## initialize row groups and value groups
rgs <- as.list(seq_len(nrow(df)));
vgs <- do.call(Map,c(c,unname(df[1:2])));

## if there are 2 or more groups, exhaustively merge overlapping value group pairs
if (length(rgs)>1L) {
    i <- 1L;
    j <- 2L;
    repeat {
        if (any(vgs[[i]]%in%vgs[[j]])) {
            rgs <- merge.elems(rgs,i,j);
            vgs <- merge.elems(vgs,i,j);
            j <- i+1L;
            if (j>length(rgs)) break;
        } else {
            j <- j+1L;
            if (j>length(rgs)) {
                i <- i+1L;
                if (i==length(rgs)) break;
                j <- i+1L;
            }; ## end if
        }; ## end if
    }; ## end repeat
}; ## end if

## results
rgs;
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 5 6
##
## [[3]]
## [1] 7
##
vgs;
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] 6 8 7
##
## [[3]]
## [1]  9 10
##

R: Identifying Data Frame Rows Connected By Shared Values In Two Columns

1 Answers1