how to do conditional summation in R

Question

I am trying to do a conditional summation based on a table that looks like this:

I am trying to do a summation of the column "value" and group by Location. Normally I would just do this:

data <- file %>%
group_by(location, date) %>%
summarize(value = sum (value))

However, only for the Location "Central" I would like to exclude the Program "B". So I tried this way, but it did not work:

data <- file %>%
group_by(location, date) %>%
summarize(value = 
case_when(location == "Central" ~ filter(program != "B")),
          TRUE ~ sum(value)
)

If someone oculd please help me with the code above, I would much appreciate that. Thank you :)

EDIT: Here is the reproducible data using dput:

structure(list(pid = c(123, 123, 123, 123, 123, 123, 123, 
123, 123, 123), program = c("A", "A", 
"A", "A", "A", 
"A", "A", "A", 
"A", "A"), location = c("Central", 
"Central", "Central", "Central", "Central", "Central", "Central", 
"Central", "Central", "Central"), locationid = c("123-Central", 
"123-Central", "123-Central", "123-Central", "123-Central", 
"123-Central", "123-Central", "123-Central", "123-Central", 
"123-Central"), date = structure(c(1302480000, 1305072000, 1307750400, 
1310342400, 1313020800, 1315699200, 1318291200, 1323561600, 1326240000, 
1328918400), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    value = c(37207.43, -56936.95, -52871, 6980.05, 10703.16, 
    4006.1, 6505.3, 9661.29, 6897.26, 7212.87)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

`filter` works on a whole data frame. You can filter first: `file |> filter(!(program == "B" & locaion == "Central")) |> group_by(...) |> summarize(...)`. Or you can use vector subsetting functions like `[` inside the `sum` like this: `case_when(location == "Central" ~ sum(value[program != "B"]), TRUE ~ sum(value)))`. But you can't use `filter` on a vector/column, nor can you use it like a result, `location == "Central" ~ filter(program != "B")` when you want the result to be a sum. — Gregor Thomas, Jun 05 '23 at 19:11
`file %>% filter(location != "Central" | program != "B") %>% group_by(location, date) %>% summarise(value = sum(value))` ? — Axeman, Jun 05 '23 at 19:14

score 2 · Accepted Answer · edited Jun 05 '23 at 19:17

2

filter works on a whole data frame. You can filter first:

file |>
  filter(!(program == "B" & location == "Central")) |>
  group_by(location, date) |>
  summarize(value = sum (value))

Or you can use vector subsetting functions like [ inside the sum like this:

data <- file |>
  group_by(location, date) |>
  summarize(value = 
    case_when(
      location == "Central" ~ sum(value[program != "B"]),
      TRUE ~ sum(value)
    )
  )

But you can't call filter on a vector/column. Nor can you use it like a result, location == "Central" ~ filter(program != "B") when you want the result to be a sum.

edited Jun 05 '23 at 19:17

Axeman

32,068
8
81
94

answered Jun 05 '23 at 19:17

Gregor Thomas

136,190
20
167
294

thank you @Gregor Thomas. I am trying to count things normally for all location, except for central I want to count A and C only and exclude "B". I used your 2nd code above (location == "Central" ~ sum(value[program != "B"]),) It seems to have worked but it gave me a bunch of duplicate rows. I was wondering if you knew why? Is there a way to do this function without creating duplicates? I can also just use dedupe later on but I was wondering if you have better suggestions as to why duplicate rows were created? – On_demand Jun 05 '23 at 19:27
`summarize` should always return 1 row per group - which means 1 row per unique location/date combination. If you see duplicates, then there is an issue. Perhaps you loaded `plyr` after `dplyr` (or some other package did), missed the printed warning message, and you are inadvertently using `plyr::summarize` instead of `dplyr::summarize`? [See this FAQ for details](https://stackoverflow.com/q/26106146/903061). If that's not it, all I can do is suggest you make a reproducible example and share your input as copy/pasteable data with `dput()`, not a screenshot. We can't test code on screenshots. – Gregor Thomas Jun 05 '23 at 19:36
here is the message I get when I use dplyr::summarize -> "Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0. ℹ Please use `reframe()` instead. ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly. Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. " I replaced summarize with reframe() and the warning went away but the duplicate rows are still there. – On_demand Jun 05 '23 at 19:38
All I can do is suggest you make a reproducible example and share your input as copy/pasteable data with `dput()`, not a screenshot. We can't test code on screenshots. For example, `dput(file[1:10, ])` for the first 10 rows. Please make sure you share a subset of the data that demonstrates the problem. – Gregor Thomas Jun 05 '23 at 19:45
apologies but I am not really sure how to use dput, I will look up online on how to use this function and get back to you. For now, I did this and it seemed to solve the issue -> location == "Central" ~ sum(value[program != "B"]), TRUE ~ sum(value) ) ) %>% distinct() – On_demand Jun 05 '23 at 19:51
I've shown you how to use `dput`, your data is named `file` (in your question) so`dput(file[1:10, ])` will make a copy/pasteable version of the first 10 rows of `file` - just copy/paste the `dput()` output into your question. – Gregor Thomas Jun 05 '23 at 20:28
It looks like your dates are actually date-times, I'd suggest converting that with `as.Date()`, but in your sample data I cannot replicate your error, nor do I get any duplicates in the result. – Gregor Thomas Jun 05 '23 at 21:19
I am using the fpp3 package to do forecasting, which is why I converted the date using yearmonth, but it is strange why I am still getting duplicates in my data – On_demand Jun 06 '23 at 17:14
If you can find a sample of data to share that demonstrates the duplicate problem, we can work on it. But if you run the code in my answer on the `dput()` shared in your question, I think you will find that there are no duplicates. – Gregor Thomas Jun 06 '23 at 17:16
I ran the code in your second answer (the one that I accepted as solution) and it ran beautifully, the problem for some reason is that it created 2 duplicate rows per row. For example, there were 3 duplicate rows for April 2021, so I just removed the 2 duplicate rows using distinct() – On_demand Jun 06 '23 at 17:42

how to do conditional summation in R

1 Answers1