The data set is a 2 column data set. Column 1 is Original element. Column 2 is a an equivalent sub for the original element. The objective is to create group ID's that group all equivalent parts into the same group ID.
I have thought about writing loop statement to do this but feels like it will adversely affect performance. The original data set to run this will have ~4 million rows of original data.
#Sample data
set.seed(78)
x = data.frame(Original = sample(letters, 10), Sub = sample(letters, 10))
#Sample output is 'Group_ID' column
y = data.frame(Original = x$Original, Sub = x$Sub, Group_ID = c("Group_01", "Group_02", "Group_02", "Group_03", "Group_04", "Group_02", "Group_05", "Group_04", "Group_06", "Group_05"))
Input is object x. Row 1 indicates that 't' and 'w' are equivalent elements and belong in a group. ROw 2 indicates that 'u' and 'o' are equivalent elements and belong in a group and so on...
Output is 'Group_ID' column in y.
Row1: t and w are included in Group_01 (first row, new group) Row2: u and o do not occur in anyt previous groups. New Group_02 is created Row3: 'o' is already part of Group_02 from Row 2. So, 'u', 'o', 'i' are all equivalent and substitutibile to each other. So, Group_02 is reused here and so on...
With this sample data, it can be seen that rows Group_02 is repeated 3 times (row #'s 2,3,6) and Group_05 is repeated 2 times (row #'s 7,10 with 'f' being common element).