Replace data entry errors with the most common value - dplyr

Question

I have a data frame which contains some data entry errors.

I wish to replace these outlier values per group with the most common value per group.

My data looks as follows:

df <- data.frame(CODE = c("J1745","J1745","J1745","J1745","J1100","J1100","J1100","J1100","J1100","J1100"),NDC = c(1234,1234,1234,1234,5678,5678,5678,5678,5678,5678),DOSAGE = c("10ML","10 ML","10 ML","10 ML","5 ML","5 ML","5 ML","5 ML","50 ML","5 ML"),DESC = c("TEXT1","TEXT 1","TEXT 1","TEXT 1","TEXT 2","TEXT 2","TEXT 2","TEXT 2","TEXT 10","TEXT 2"))

As you can see my DOSAGE and DESC columns contain some inconsistencies and I would like to replace them with the most common value within each group.

My desired output looks as follows:

Why do you want to replace with the most common value? it is not a good practice. I think it would be better to try to use some regular expression to try to replace them with their correct value, and not just imputing by the most frequent category. — Tomas Capretto, Mar 22 '21 at 01:43
Can you please post your input and output examples as text (code blocks) rather than screenshots? It's more accessible ... — Ben Bolker, Mar 22 '21 at 01:52
You accepted @RonakShah's answer, but see my comment there for why this might not be what you want.. Can you please clarify what you mean by/how you would define "outlier values" ? — Ben Bolker, Mar 22 '21 at 17:14

score 3 · Answer 1 · answered Mar 22 '21 at 01:51

I agree with the comment that this is potentially dangerous.

The code below replaces elements that have <= a specified number of occurrences with the most common value. I use base-R machinery within the replacement function because that's what I know how to do.

repl_common <- function(x,n=1) {
    tt <- tapply(x,x,length)          ## count number of instances
    m <- names(tt)[which.max(tt)]     ## find mode
    x[tt[as.character(x)]<=n] <- m    ## replace
    return(x)
}
## apply by group across specified columns
df %>% group_by(CODE) %>% mutate(across(c(DOSAGE,DESC), repl_common))

Replacing the first line of the function with a call to `tabulate()` might be slightly cleaner. — Ben Bolker, Mar 22 '21 at 17:15

score 2 · Accepted Answer · answered Mar 22 '21 at 05:11

You can use the Mode function from here to get the most common value.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Apply this function by group.

library(dplyr)
df %>% group_by(CODE, NDC) %>% mutate(across(c(DOSAGE, DESC), Mode)) %>% ungroup

#    CODE    NDC DOSAGE DESC  
#   <chr> <dbl> <chr>  <chr> 
# 1 J1745  1234 10 ML  TEXT 1
# 2 J1745  1234 10 ML  TEXT 1
# 3 J1745  1234 10 ML  TEXT 1
# 4 J1745  1234 10 ML  TEXT 1
# 5 J1100  5678 5 ML   TEXT 2
# 6 J1100  5678 5 ML   TEXT 2
# 7 J1100  5678 5 ML   TEXT 2
# 8 J1100  5678 5 ML   TEXT 2
# 9 J1100  5678 5 ML   TEXT 2
#10 J1100  5678 5 ML   TEXT 2

This does match the desired output, but it doesn't do what the OP said they wanted. This replaces **all** of the values in a group with the mode. The OP said they wanted to replace "outlier values" (not clearly defined) with the mode. — Ben Bolker, Mar 22 '21 at 17:13

Replace data entry errors with the most common value - dplyr

2 Answers2