0

I have as part of a SQL query the following as part of an aggregation that I want to convert to dplyr syntax:

case when sum(amy_jan) != 0 then sum(cost_jan)/sum(amy_jan) else 0 end  as ratio_jan
-- ...
case when sum(amy_dec) != 0 then sum(cost_dec)/sum(amy_dec) else 0 end  as ratio_dec

What would that look like in dplyr syntax?

r2evans
  • 141,215
  • 6
  • 77
  • 149
bill rowe
  • 43
  • 5
  • You can't just convert this to dplyr. SQL isn't executed as. is. The database creates an execution plan from the query that's *very* different, may decide to cache intermediate results, ensure sums are calculated only once etc. You have to think how to make the calculations in an efficient way and avoid eg calculating `sum(amy_dec)` multiple times over 1M rows. It's highly unlikely you'll be able to perform those calculations faster than a database server with dozens of cores and GB of RAM – Panagiotis Kanavos Oct 01 '20 at 18:05
  • Thanks. Appreciate the insight. – bill rowe Oct 01 '20 at 21:10

1 Answers1

0

If all you're looking for is the dplyr equivalent to your SQL-case ... when ... statement, it is dplyr::case_when().
More specifically, on a use-case like the one you seem to be describing (giving a reproducible example of any kind and your expected output would've been helpful here!), something like this would showcase this function:

library(dplyr)
library(purrr)

months <- tolower(month.abb)

# here I'm dummying some data, like what you're describing
df <- months %>%
  map_dfc(
    ~ tibble(
      !!paste0("cost_", .x) := sample(0:10, 100, replace = T),
      !!paste0("amy_", .x) := sample(0:10, 100, replace = T)
    )
  )

# summarise that dataframe, once for every month in our list
df_summary <- df %>%
  summarise(
    ratio_jan = case_when(
      sum(amy_jan) > 0 ~ sum(cost_jan) / sum(amy_jan),
      T ~ 0
    ),
    # ratio_feb = case_when(...),
    # ... and so on, for every month
  )

The above incorporates case_when to reproduce in dplyr what you've shared in SQL flavor.

Below code is an extension, using purrr::map_dfc to iterate over the abbreviated months and col-bind the calculated "ratio_xxx" values. That saves you the hassle of having to hard code 12 variable calculations... :)

df_summary2 <- months %>%
  map_dfc(
    ~ df %>%
      summarise(
        !!paste0("ratio_", .x) := case_when(
          sum(.data[[paste0("amy_", .x)]]) > 0 ~ sum(.data[[paste0("cost_", .x)]]) / sum(.data[[paste0("amy_", .x)]]),
          T ~ 0
        )
      )
  )
alex_jwb90
  • 1,663
  • 1
  • 11
  • 20