I am working with a dataset of more than 3 million observations. This data set includes more than 770,000 unique IDs that are of interest to me. The data includes descriptive information about these IDs. The challenge is that these unique IDs contain non-unique duplicates, which means I need to find a way to consolidate the data.
After much thinking, I decided to take the mode of each column for each ID in the data set. The output gives me most common value for each column for each id. By taking the most common value, I am able to consolidate the non-unique duplicates into one row per each id.
The problem: To do so, I have iterate over 770,000 unique ids in a for loop. I want to use code that will be as efficient as possible because the for loop I have been using takes days to complete.
Given the code I have provided, is there a way to optimize the code, use parallel processing, or a different way to complete the task more efficiently?
Reproducible code:
ID <- c(1,2,2,3,3,3)
x1 <- c("A", "B", "B","C", "C", "C")
x2 <- c("alpha", "bravo", "bravo", "charlie", "charlie2", "charlie2")
x3 <- c("apple", "banana", "banana", "plum1", "plum1", "plum")
df <- data.frame(ID, x1, x2, x3)
#Mode Function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
library(reshape2)
#Takes the mode for every column
mode_row <- function(dat){
x <- setNames(as.data.frame(apply(dat, 2, getmode)), c("value"))
x$variable <- rownames(x); rownames(x) <- NULL
mode_row <- reshape2::dcast(x, . ~ variable, value.var = "value")
mode_row$. <- NULL
return(mode_row)
}
#Take the mode of each row to account for duplicate donors
df2 <- NULL
for(i in unique(df$ID)){
df2 <- rbind(df2, mode_row(subset(df, ID == i)))
#message(i)
}
df2
Expected Output:
ID x1 x2 x3
1 1 A alpha apple
2 2 B bravo banana
3 3 C charlie2 plum1