0

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:

df <- data.frame(ID = c(1,2,3,4,5), 
                 var1 = c(1,NA,3,6,NA),
                 var2 = c(NA,1,2,6,6),
                 var3 = c(NA,2,NA,4,3))

imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5), 
                 var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
                 var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
                 var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))

I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:

library(dplyr)

mode <- function(codes){
  which.max(tabulate(codes))
}

var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))

df3 <- df %>% 
  left_join(var1, by = "ID") %>% 
  mutate(var1 = coalesce(var1.x, var1.y)) %>% 
  select(-var1.x, -var1.y)

Thus, the original value in df is replaced with the mode only if the value was NA.

It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.

x <- colnames(df)

# Attempting to iterate through variables names using i
i = as.factor(x[[2]])

This is where I am stuck. Any help is much appreciated!

BLS
  • 59
  • 6
  • `df %>% left_join(var1, by = "ID") %>% mutate(var1 = coalesce(var1.x, var1.y))` returns an error `Error: Problem with `mutate()` column `var1`. ℹ `var1 = coalesce(var1.x, var1.y)`.x Can't combine `..1` and `..2` .` Can you fix it? – Peace Wang Jan 04 '22 at 03:15
  • I've run this code a few times and checked that I didn't have other packages loaded that might be interfering, and I am not receiving any errors. I wonder if anyone else is having an issue with this code? – BLS Jan 04 '22 at 03:27
  • That is because you do have a function called `mode` that returns a numeric value yet this function is not defined. You need to post the function too. Also, you rather call it `Mode` instead of `mode` since `mode` refers to the data type storage. eg check `?base::mode`. – Onyambu Jan 04 '22 at 03:31
  • @PeaceWang You are calling `mode` function from base R yet OP is calling a user defined function. – Onyambu Jan 04 '22 at 03:32
  • @Onyambu Clear. Thanks. – Peace Wang Jan 04 '22 at 03:36
  • Thank you - I completely forget about that line as I've been playing around with the code and it's looking pretty messy...I added the function to the post. – BLS Jan 04 '22 at 03:47

3 Answers3

1

Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.

library(tidyverse)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

imp_long <- imp %>%
  group_by(ID) %>%
  summarise(across(everything(), Mode)) %>%
  pivot_longer(-ID)

df %>%
  pivot_longer(-ID) %>%
  left_join(imp_long, by = c("ID", "name")) %>%
  mutate(var1 = coalesce(value.x, value.y)) %>%
  select(-c(value.x, value.y)) %>%
  pivot_wider(names_from = "name", values_from = "var1")

Output

# A tibble: 5 × 4
     ID  var1  var2  var3
  <dbl> <dbl> <dbl> <dbl>
1     1     1     3     6
2     2     5     1     2
3     3     3     2     5
4     4     6     6     4
5     5     3     6     3
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
0

You can use -

library(dplyr)

mode_data <- imp %>% 
  group_by(ID) %>% 
  summarise(across(starts_with('var'), Mode))

df %>%
  left_join(mode_data, by = 'ID') %>%
  transmute(ID, 
            across(matches('\\.x$'), 
            function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]), 
            .names = '{sub(".x$", "", .col)}'))

#  ID var1 var2 var3
#1  1    1    3    6
#2  2    5    1    2
#3  3    3    2    5
#4  4    6    6    4
#5  5    3    6    3
  • mode_data has Mode value for each of the var columns.
  • Join df and mode_data by ID.
  • Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
  • Use coalesce to select the non-NA value in each pair.
  • Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.

where Mode function is taken from here

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0
library(dplyr, warn.conflicts = FALSE)

imp %>% 
  group_by(ID) %>% 
  summarise(across(everything(), Mode)) %>% 
  bind_rows(df) %>% 
  group_by(ID) %>% 
  summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#>      ID  var1  var2  var3
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     3     6
#> 2     2     5     1     2
#> 3     3     3     2     5
#> 4     4     6     6     4
#> 5     5     3     6     3

Created on 2022-01-03 by the reprex package (v2.0.1)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38