0

I am working with a dataset of more than 3 million observations. This data set includes more than 770,000 unique IDs that are of interest to me. The data includes descriptive information about these IDs. The challenge is that these unique IDs contain non-unique duplicates, which means I need to find a way to consolidate the data.

After much thinking, I decided to take the mode of each column for each ID in the data set. The output gives me most common value for each column for each id. By taking the most common value, I am able to consolidate the non-unique duplicates into one row per each id.

The problem: To do so, I have iterate over 770,000 unique ids in a for loop. I want to use code that will be as efficient as possible because the for loop I have been using takes days to complete.

Given the code I have provided, is there a way to optimize the code, use parallel processing, or a different way to complete the task more efficiently?

Reproducible code:

ID <- c(1,2,2,3,3,3)
x1 <- c("A", "B", "B","C", "C", "C")
x2 <- c("alpha", "bravo", "bravo", "charlie", "charlie2", "charlie2")
x3 <- c("apple", "banana", "banana", "plum1", "plum1", "plum")

df <- data.frame(ID, x1, x2, x3)

#Mode Function
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

library(reshape2)

#Takes the mode for every column
mode_row <- function(dat){
  x <- setNames(as.data.frame(apply(dat, 2, getmode)), c("value"))
  x$variable <- rownames(x); rownames(x) <- NULL
  mode_row <- reshape2::dcast(x, . ~ variable, value.var = "value")
  mode_row$. <- NULL
  return(mode_row)
}

#Take the mode of each row to account for duplicate donors
df2 <- NULL
for(i in unique(df$ID)){
  df2 <- rbind(df2, mode_row(subset(df, ID == i)))
  #message(i)

}

df2

Expected Output:

  ID x1       x2      x3
1  1  A    alpha   apple
2  2  B    bravo banana
3  3  C charlie2   plum1

Sharif Amlani
  • 1,138
  • 1
  • 11
  • 25
  • 1
    My immediate reaction is "don't use a `for` loop". Use grouped processing (for instance either in base R or `tidyverse` to get the same results. That will be much more efficient and give more compact and readable code. Please post your expected output. – Limey Jun 05 '20 at 07:29

1 Answers1

1

There are grouped functions available in base R, dplyr and data.table :

Base R :

aggregate(.~ID, df, getmode)

#  ID x1       x2      x3
#1  1  A    alpha   apple
#2  2  B    bravo  banana
#3  3  C charlie2   plum1

dplyr :

library(dplyr)
df %>% group_by(ID) %>% summarise(across(x1:x3, getmode))
#Use summarise_at in older version of dplyr
#df %>% group_by(ID) %>% summarise_at(vars(x1:x3), getmode)

data.table :

library(data.table)
setDT(df)[, lapply(.SD, getmode), ID, .SDcols = x1:x3]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Y'all are the best. Here I was waiting all day (and last night) for the for loop to finish and you come through with the aggregate function that works in less than a second and gets the job done lol Thanks a million!! – Sharif Amlani Jun 05 '20 at 07:35
  • The aggregate function is producing numeric variables when applies to the data and in my larger data frame only produces the first 9 rows? – Sharif Amlani Jun 05 '20 at 07:42
  • Numeric variables may be because you have data as factors? Convert to characters and try again? – Ronak Shah Jun 05 '20 at 07:47
  • Nope, I converted to characters and still got numeric values sadly. However, the data.table example perfectly on my real dataset so I'm incredibly grateful! – Sharif Amlani Jun 05 '20 at 07:53
  • 1
    Probably this would work then `aggregate(cbind(x1, x2, x3)~ID, df, getmode)` – Ronak Shah Jun 05 '20 at 07:54
  • Yup! That worked. However, while it worked on the example, it produces only the first 9 rows in my real dataset. The data.table example perfectly on my real dataset, so I got what I need, but its a bit strange that it didn't work on my real dataset. – Sharif Amlani Jun 05 '20 at 07:58