4

I have a data set that I like to impute one value among others based on probability distribution of those values. Let make some reproducible example first

library(tidyverse)
library(janitor)

dummy1 <- runif(5000, 0, 1)
dummy11 <- case_when(
    dummy1 < 0.776 ~ 1,
    dummy1 < 0.776 + 0.124 ~ 2,
    TRUE ~ 5)

df1 <- tibble(q1 = dummy11)

here is the output:

df1 %>% tabyl(q1)
 q1    n percent
  1 3888  0.7776
  2  605  0.1210
  5  507  0.1014

I used mutate and sample to share value= 5 among value 1 and 2 like this:

df1 %>%
    mutate(q1 = case_when(q1 == 5 ~ sample(
        2,
        length(q1),
        prob = c(0.7776, 0.1210),
        replace = TRUE
    ),
    TRUE ~ as.integer(q1))
    )

and here is the result :

q1    n percent
  1 4322  0.8644
  2  678  0.1356

This approach seems working, however since I need to apply this for several variables I tried to write a function that working with tidyverse with tidyeval, like this

    my_impute <- function(.data, .prob_var, ...) {
        .prob_var <- enquo(.prob_var)

        .data %>%
            sample(2, prob=c(!!.prob_var), replace = TRUE) 
    }

# running on data 
df1 %>%
    mutate(q1 = case_when(q1 == 5 ~ !!my_impute(q1),
    TRUE ~ as.integer(q1))
    )

The error is :

Error in eval_tidy(pair$lhs, env = default_env) : object 'q1' not found
DanG
  • 689
  • 1
  • 16
  • 39
  • In the last part, you are passing the `prob` from the original dataset where 'q1' is `integer` where as outside the function, it seems to be based on the `tabyl` output – akrun Oct 16 '19 at 17:33
  • sorry @akrun , I donot follow , q1 is either 1, 2 or 5. Ah, yeah, I used tabyl to show the output – DanG Oct 16 '19 at 17:45
  • What I meant is that in the code outside the function, you pass `prob` as `prob = c(0.7776, 0.1210)` which is from `tabyl` output. Inside the function, it is just passing the column 'q1', and not the prob values – akrun Oct 16 '19 at 17:47

2 Answers2

2

We need the prob values from the 'percent' column generated from tabyl, so the function can be modified to

library(janitor)
library(dplyr)

my_impute <- function(.data, .prob_var, vals, ...) {
        .prob_var = enquo(.prob_var)
        .prob_vals <- .data %>%
             janitor::tabyl(!!.prob_var) %>%
             filter(!!.prob_var %in% vals) %>%
             pull(percent)

         .data %>%
              mutate(!! .prob_var := case_when(!! .prob_var == 5 ~ 
                sample(
                        2,
                        n(),
                        prob = .prob_vals,
                        replace = TRUE
                    ),
                    TRUE ~ as.integer(q1))
                    )
    }


df1 %>% 
     my_impute(q1, vals = 1:2) %>%
     tabyl(q1)
# q1    n percent
# 1 4285   0.857
# 2  715   0.143
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @akrun , can you explain 1:2 here: `filter(!!.prob_var %in% 1:2)` , so if we have value 1, 3 and 5 (instead of 1,2 and 5) or more 3 values , lets say 1,3,4,5,6 . For example I manipulated input to have only 1,3 and 5, I tried it replaced filter with `filter(!!.prob_var %in% c(1,3))` and it returns 1,2,3 – DanG Oct 16 '19 at 18:23
  • 1
    @DanielG In that case, you can pass that as another argument to function. Here, the `filter` is filtering the rows where the value in 'q1' are 1 and 2. I changed the function to pass 'vals' as another vector of values – akrun Oct 16 '19 at 18:27
  • I am just trying to figure it out , hope you don't mind. if we have `dummy11 <- case_when( dummy1 < 0.776 ~ 1, dummy1 < 0.776 + 0.124 ~ 3, TRUE ~ 5)` so with updated function I am passing vals like `df1 %>% my_impute(q1, vals = c(1,3)) %>% tabyl(q1)` but the output is returning value 1, 2 and 3 – DanG Oct 16 '19 at 18:42
  • 1
    @DanielG That is a different function where you are selecting only a subset of values in each step, in that case, you may need to change the `case_when` inside the function – akrun Oct 16 '19 at 18:56
2

Just to add my two cents, the new version of rlang allows to replace the quasiquotation process: enquo() + !! and you can use curly-curly to embrace variables: The function would be like:

my_impute <- function(.data, .prob_var, vals, ...) {

  #.prob_var = enquo(.prob_var)
  # commented out since it is no longer needed
  .prob_vals <- .data %>%
    janitor::tabyl({{.prob_var}}) %>%
    filter({{.prob_var}} %in% {{vals}}) %>%
    pull(percent)

  .data %>%
    mutate( {{.prob_var}} := case_when( {{.prob_var}} == 5 ~ 
                                       sample(
                                         2,
                                         n(),
                                         prob = {{.prob_vals}},
                                         replace = TRUE
                                       ),
                                     TRUE ~ as.integer(q1))
    )
}