What you're asking for is called imputation. I'll demonstrate using mtcars
as the data, cyl
as the grouping variable (your room_type
, I suspect), and other columns with NA
values.
mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 NA 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 NA 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From here:
dplyr
library(dplyr)
set.seed(42)
mt %>%
group_by(cyl) %>%
mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
ungroup()
# # A tibble: 32 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 91 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 145 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 301 180 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 301 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 52 3.92 3.15 22.9 1 0 4 2
# 10 19.2 6 225 123 3.92 3.44 18.3 1 0 4 4
# # … with 22 more rows
data.table
library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
head()
>
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 79.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 113 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 460.0 97 3.08 3.215 19.44 1 0 3 1
# 5: 18.7 8 146.7 62 3.15 3.440 17.02 0 0 3 2
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
base R
set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 91 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 145 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 301 180 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA
values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.