0

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:

#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)


my_df<- data.frame(x = x, y =y, date =date)

test<- my_df %>% group_by(date) %>% 
  summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
  mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
  ungroup() %>% filter(date >= "2018-01-03")

The function I am testing is as follows:

cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {

date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)

data %>% group_by(!!date_field) %>% 
  summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
  mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
  ungroup() %>% filter((!!date_field) >= minimum_date)

}

test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")

I have looked looked at some examples of using enquo and this thread gets me half way there:

Use variable names in functions of dplyr

But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.

I also tried this instead

cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {

date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))

data %>% group_by(!!date_field) %>% 
  summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)

}

test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")

Based on this thread: Pass a data.frame column name to a function

But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.

If anyone can help that would be much appreciated.

MrMonkeyBum
  • 55
  • 1
  • 6
  • Can you check the example. the 'date' column have all unique values – akrun Jun 08 '18 at 15:37
  • I don't get any difference between the output `identical(test, test2)# [1] TRUE` BTW, you don't need to evaluate a logical vector i.e. `sum(cumulative_y, na.rm = TRUE)` is enough – akrun Jun 08 '18 at 15:41
  • 1
    Ah many thanks, I'm getting the same answer too now at home, but was getting a completely different data.frame output on my terminal at work. Going to try again when I'm back. – MrMonkeyBum Jun 08 '18 at 20:06
  • May be loaded both `plyr` and `dplyr` in one of the sessions? – akrun Jun 09 '18 at 03:07
  • @akrun have tried with both plyr and dplyr loaded on work terminal and getting different answers again. It's very strange I have had a colleague run the code and he gets the same answer so there must be something wrong with the internal working of my R session but have no idea what it could be – MrMonkeyBum Jun 11 '18 at 08:32
  • OK I have solved the issue by updating the Tidyverse package. It looks like an older version of one of the dplyr dependencies was causing the issue, possibly rlang. Can mark as resolved. – MrMonkeyBum Jun 11 '18 at 09:07

0 Answers0