2

The data set is a 2 column data set. Column 1 is Original element. Column 2 is a an equivalent sub for the original element. The objective is to create group ID's that group all equivalent parts into the same group ID.

I have thought about writing loop statement to do this but feels like it will adversely affect performance. The original data set to run this will have ~4 million rows of original data.

#Sample data
set.seed(78)
x = data.frame(Original = sample(letters, 10), Sub = sample(letters, 10))


#Sample output is 'Group_ID' column
y = data.frame(Original = x$Original, Sub = x$Sub, Group_ID = c("Group_01", "Group_02", "Group_02", "Group_03", "Group_04", "Group_02", "Group_05", "Group_04", "Group_06", "Group_05"))

Input is object x. Row 1 indicates that 't' and 'w' are equivalent elements and belong in a group. ROw 2 indicates that 'u' and 'o' are equivalent elements and belong in a group and so on...

Output is 'Group_ID' column in y.

Row1: t and w are included in Group_01 (first row, new group) Row2: u and o do not occur in anyt previous groups. New Group_02 is created Row3: 'o' is already part of Group_02 from Row 2. So, 'u', 'o', 'i' are all equivalent and substitutibile to each other. So, Group_02 is reused here and so on...

With this sample data, it can be seen that rows Group_02 is repeated 3 times (row #'s 2,3,6) and Group_05 is repeated 2 times (row #'s 7,10 with 'f' being common element).

Prabha
  • 21
  • 3
  • Look at packages for working with graphs, like `igraph`. Think of your data as a graph, where each `id` is a node, and the 2-column data set is a list of adjacencies, showing which nodes are connected. The groups you want are the connected components of the graph. [Here's an example](https://stackoverflow.com/q/29730624/903061). – Gregor Thomas Apr 01 '19 at 02:38

1 Answers1

2

I second @Gregor's suggestion to identify the connected components of the graph induced by x; here is an example based on your data

library(igraph)
library(dplyr)
x %>%
    left_join(graph_from_data_frame(x) %>%
        components() %>%
        magrittr::extract2(1) %>%
        stack(),
        by = c("Original" = "ind")) %>%
    mutate(values = paste("Group", values, sep = ""))
#   Original Sub values
#1         t   w Group1
#2         u   o Group2
#3         o   i Group2
#4         r   p Group3
#5         s   g Group4
#6         i   z Group2
#7         f   e Group5
#8         g   l Group4
#9         b   a Group6
#10        v   f Group5

Explanation: Construct an igraph from x, determine all connected components and merge the identified group labels with your original data.

Please note that there seems to be a mistake in your expected output; rows 5+8 should have the same group label.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thanks Maurits. This works great and performance is also good for 4M records (~20 seconds). Thanks for pointing out the error in output in example shown. Corrected in question. – Prabha Apr 01 '19 at 03:31