My objective is to identify "connected" rows within a single data frame based on the shared values between two columns in R.
In this example, there are 10 unique segments (i.e., clusters of data) which are identified by integers corresponding to each unique segment. Each row represents two segments which were already determined to be within a certain distance threshold of each other. There is no significant distinction between the columns "segA" and "segB", they are just used to keep track of the pairs of segments which are connected. The column "dist" represents the distance between the pair of segments, but is not really needed at this point, as the data frame only contains those pairs of segments which are deemed "connected."
I'm trying to figure out a way of identifying all of the rows which have at least one shared value in "segA" or "segB", indicating a connected segment between rows.
My initial attempts have been convoluted for loops and logical statements (I'm new to R programming), so I would greatly appreciate any concise solutions!
Example:
df = data.frame(
segA = c(1, 1, 2, 4, 6, 7, 9),
segB = c(2, 3, 4, 5, 8, 8, 10),
dist = c(0.5321, 0.3212, 0.4351, 0.1421, 0.5125, 0.1692, 0.3218)
)
df
segA segB dist
1 1 2 0.5321
2 1 3 0.3212
3 2 4 0.4351
4 4 5 0.1421
5 6 8 0.5125
6 7 8 0.1692
7 9 10 0.3218
Rows 1 and 2 are connected because they both contain segment "1".
Rows 3 and 1 are connected because they both contain segment "2", etc.
Even though rows 2 and 3 aren't directly connected by the presence of shared segments, they are connected,overall, by their mutual connection through row 1.
The desired final output would be something like:
(1) = 1, 2, 3, 4, 5
(2) = 6, 7, 8
(3) = 9, 10
where (1), (2), and (3) represent the distinct overall segments and their components which are directly/mutually connected.