6

I'm trying to create many different possible weighting schemes based on temperature.

I created a data frame with all possible combinations of 8 vectors (each vector represents a temperature range). So the columns of the data frame are a specific temperature range and the rows are weights.

I would like to pass the temperature ranges as arguments to case_when, and loop through each row of the weights data frame, creating a new variable for each row based on the actual temperature and the associated weight for that temperature based on the information in the weights data frame.

Using the following post, I was able to create a function to produce the weights data frame:

Use dplyr::case_when with arguments programmatically

But I don't know how to construct the case_when arguments using the weights data frame.

Function to create data frame of all possible weights

library(rlang)
library(tidyverse)

create_temp_weights <- function(
  from = 31,
  to = 100,
  by = 10,
  weights = exprs(between(., 31, 40) ~ c(0, 0.2),
                  between(., 41, 50) ~ c(0.5, 0.8),
                  between(., 51, 90) ~ c(0.8, 1),
                  between(., 91, 100) ~ c(0.2, 0.8),
                  TRUE ~ c(-0.1, 0))
) {

  # use 999 to map other temperatures to last case
  map(c(seq(from, to, by), 999), ~ case_when(!!!weights)) %>%
    set_names(c(map_chr(seq(from, to, by),
                      ~ str_c("temp_", ., "_", . + by - 1)), "temp_other")) %>%
  cross_df(.)

}

temp_weights <- create_temp_weights()

Create tibble with vector of temperatures used to construct the weights

test_tibble <- tibble(temp = seq_len(100))

head(test_tibble)

The following case_when is the thing I'm trying to produce programmatically using the weights data frame.

# Now I want to create a function that will produce the following
# case_when from the temp_weight data frame so I don't have to
# manually edit the following each time I create a new weights data frame

test_tibble2 <- map_dfc(.x = seq_len(nrow(temp_weights)),
    ~ transmute(
      test_tibble,
      temp =
        case_when(
          temp >= 31   & temp  <= 40   ~  temp_weights$temp_31_40[.x],
          temp >= 41   & temp  <= 50   ~  temp_weights$temp_41_50[.x],
          temp >= 51   & temp  <= 60   ~  temp_weights$temp_51_60[.x],
          temp >= 61   & temp  <= 70   ~  temp_weights$temp_61_70[.x],
          temp >= 71   & temp  <= 80   ~  temp_weights$temp_71_80[.x],
          temp >= 81   & temp  <= 90   ~  temp_weights$temp_81_90[.x],
          temp >= 91   & temp  <= 100  ~  temp_weights$temp_91_100[.x],
          TRUE & !is.na(temp)          ~  temp_weights$temp_other[.x]
        )
      ) %>% set_names(paste0("temp_wt_", .x))
    ) 

head(test_tibble2)

So what I'm looking for is a function that constructs the case_when arguments from a weights data frame.

Giovanni Colitti
  • 1,982
  • 11
  • 24
  • Not sure but maybe `cut` might be a more efficient function for the job? – NelsonGon Jul 31 '19 at 16:24
  • 1
    Yeah, I should have mentioned that efficiency is important because the actual temperature variable from which the weights are constructed is 40 million observations long. So if `cut` can be used here more efficiently I'm happy to switch. But then the question becomes: How can I generate the arguments to `cut` from the weights data frame? – Giovanni Colitti Jul 31 '19 at 16:33
  • 1
    It's a very good question! If you run `debugonce(case_when)` and then your `map_dfc` call you can observer that `case_when` arguments are parsed via `fs <- compact_null(list2(...))`. One potential solution would be to use [`trace`](https://stackoverflow.com/a/2458377/1655567) to replace that **`fs`** object with the values you would programmatically generate using the data frame you have. – Konrad Jul 31 '19 at 16:34
  • So basically, my initial thinking would be to approach this [like that](https://gist.github.com/konradzdeb/199db4d53e87b6aec068bcfbecc6a6f1) this is a very dirty solution that attempts to inject the externally constructed `case_when` object. This shows my thinking but returns an error as you would have to understand that `str` behind the **`fs`** object and then replace it properly; not like in my lame attempt. – Konrad Jul 31 '19 at 16:45
  • @Konrad Thank you for your comments. I'm not sure how the arguments to `case_when` are being generated programmatically from the weights data frame. If I change the number of temperature intervals using the function in the beginning of my post, the number of arguments to `case_when` will need to change as well. I just don't want to have to do this manually. – Giovanni Colitti Jul 31 '19 at 16:55

2 Answers2

1

Closely mimicking OP:

windows <- 
  str_extract_all(names(temp_weights), "\\d+") %>% 
  modify(as.integer) %>% 
  modify_if(negate(length), ~ c(-Inf, Inf)) %>% 
  set_names(names(temp_weights))

temp <- test_tibble$temp

res <-
  map_dfc(
    seq_len(nrow(temp_weights)), 
    ~ {
      row <- .
      rlang::eval_tidy(expr(case_when(
        !!! imap(
          windows, 
          ~ expr(
            between(temp, !! .x[1], !! .x[2]) ~ !! temp_weights[[.y]][row]
          )
        )
      )))
    }
  ) %>% 
  set_names(paste0("temp_wt_", seq_along(.)))

all.equal(res, test_tibble2)
#> [1] TRUE 

Slightly more efficient (not repeating case_when for each weight combination):

res2 <- 
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows, 
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>% 
  do.call(what = rbind) %>% 
  as_tibble() %>% 
  set_names(paste0("temp_wt_", seq_along(.)))

all.equal(res2, test_tibble2)
#> [1] TRUE   
Aurèle
  • 12,545
  • 1
  • 31
  • 49
0

This is meant to supplement the accepted answer by Aurèle.

Here, I compare efficiency between Aurèle's two proposed solutions and a final solution using data.table, which also provides the option to preserve NAs.

suppressPackageStartupMessages(library(rlang))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tictoc))

create_temp_weights <- function(
  from = 31,
  to = 100,
  by = 10,
  weights = exprs(between(., 31, 40) ~ c(0, 0.2),
                  between(., 41, 50) ~ c(0.5, 0.8),
                  between(., 51, 90) ~ c(0.8, 1),
                  between(., 91, 100) ~ c(0.2, 0.8),
                  TRUE ~ c(-0.1, 0))
) {

  # use 999 to map other temperatures to last case
  map(c(seq(from, to, by), 999), ~ case_when(!!!weights)) %>%
    set_names(c(map_chr(seq(from, to, by),
                        ~ str_c("temp_", ., "_", . + by - 1)), "temp_other")) %>%
    cross_df(.)

}

temp_weights <- create_temp_weights()

test_tibble <- tibble(temp = rnorm(1000000, 50, 15))

test_tibble2 <- map_dfc(.x = seq_len(nrow(temp_weights)),
                        ~ transmute(
                          test_tibble,
                          temp =
                            case_when(
                              temp >= 31   & temp  <= 40   ~  temp_weights$temp_31_40[.x],
                              temp >= 41   & temp  <= 50   ~  temp_weights$temp_41_50[.x],
                              temp >= 51   & temp  <= 60   ~  temp_weights$temp_51_60[.x],
                              temp >= 61   & temp  <= 70   ~  temp_weights$temp_61_70[.x],
                              temp >= 71   & temp  <= 80   ~  temp_weights$temp_71_80[.x],
                              temp >= 81   & temp  <= 90   ~  temp_weights$temp_81_90[.x],
                              temp >= 91   & temp  <= 100  ~  temp_weights$temp_91_100[.x],
                              TRUE & !is.na(temp)          ~  temp_weights$temp_other[.x]
                            )
                        ) %>% set_names(paste0("temp_wt_", .x))
) 

windows <- 
  str_extract_all(names(temp_weights), "\\d+") %>% 
  modify(as.integer) %>% 
  modify_if(negate(length), ~ c(-Inf, Inf)) %>% 
  set_names(names(temp_weights))

Solution #1

temp <- test_tibble$temp

tic()
res <-
  map_dfc(
    seq_len(nrow(temp_weights)), 
    ~ {
      row <- .
      rlang::eval_tidy(expr(case_when(
        !!! imap(
          windows, 
          ~ expr(
            between(temp, !! .x[1], !! .x[2]) ~ !! temp_weights[[.y]][row]
          )
        )
      )))
    }
  ) %>% 
  set_names(paste0("temp_wt_", seq_along(.)))
toc()
#> 65.18 sec elapsed

all.equal(res, test_tibble2)
#> [1] TRUE

Solution #2

tic()
res2 <- 
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows, 
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>% 
  do.call(what = rbind) %>% 
  as_tibble() %>% 
  set_names(paste0("temp_wt_", seq_along(.)))
#> Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
#> This warning is displayed once per session.
toc()
#> 2.76 sec elapsed

all.equal(res2, test_tibble2)
#> [1] TRUE

Solution #3 Using data.table

tic()
res3 <-
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows,
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>%
  data.table::transpose(., fill = NA) %>%
  set_names(paste0("temp_wt_", seq_along(.))) %>%
  as_tibble()
toc()
#> 4.69 sec elapsed

all.equal(res3, test_tibble2)
#> [1] TRUE

In summary, solution #2 seems to be the fastest (2.76 sec) followed by the data.table solution (4.69 sec). However, I appreciate that the data.table solution has the fill option to preserve NAs.

Created on 2019-08-02 by the reprex package (v0.3.0)

Giovanni Colitti
  • 1,982
  • 11
  • 24