How to use dplyr to group_by multiple variables and sum other variables

Question

I have a dataframe combined_data that looks like this (this is just an example):

Year    state_name       VoS_thousUSD     industry
2008    Alabama          100              Shipping
2009    Alabama          100              Shipping
2008    Alabama          200              Shipping
2010    Alabama          100              Shipping
2010    Alabama          50               Shipping
2010    Alabama          100              Shipping
2008    Alabama          100              Shipping

There are multiple Year, state_name, and industry variable, with associated VoS_thousUSD values, as well as other columns I no longer need.

I am trying to produce this

Year    state_name       VoS_thousUSD     industry
2008    Alabama          400              Shipping
2009    Alabama          100              Shipping
2010    Alabama          250              Shipping

Where the dataframe is grouped by Year, state_name, and industry, and VoS_thousand is a sum by those groups.

So far I have

combined_data %>%
  group_by(Year, state_name, GCAM_industry) %>% 
  summarise() -> VoS_thousUSD_state_ind

But I am not sure how/where to add in the sum for VoS_thousUSD. Would like to use a dplyr pipeline.

Change `summarise()` to `summarise(Vos_thousUSD = sum(Vos_thousUSD))` — Gregor Thomas, Jun 08 '20 at 18:13
I mean, I don't always dupe-close, but for the the top R-FAQs I pretty much always do when I see it. When my page reloaded after closing, your answer loaded with it. Isn't comment and close much more efficient than writing the same answer again and again? — Gregor Thomas, Jun 08 '20 at 18:24
I'm sure I've missed many - sum by group, mean by group, sorting, merging, I'm sure the top FAQs have 1000s of unmarked dupes. But when I see it and am in it, I try to close it. — Gregor Thomas, Jun 08 '20 at 18:26

akrun · Accepted Answer · 2020-06-08T18:20:19.860

We can use

aggregate( VoS_thousUSD~ ., combined_data, FUN = sum)

Or with dplyr

library(dplyr)
combined_data %>%
   group_by(Year, state_name, industry) %>% 
   summarise(VoS_thousUSD = sum(VoS_thousUSD))
# A tibble: 3 x 4
# Groups:   Year, state_name [3]
#   Year state_name industry VoS_thousUSD
#  <int> <chr>      <chr>           <int>
#1  2008 Alabama    Shipping          400
#2  2009 Alabama    Shipping          100
#3  2010 Alabama    Shipping          250

data

combined_data <- structure(list(Year = c(2008L, 2009L, 2008L, 2010L, 2010L, 2010L, 
2008L), state_name = c("Alabama", "Alabama", "Alabama", "Alabama", 
"Alabama", "Alabama", "Alabama"), VoS_thousUSD = c(100L, 100L, 
200L, 100L, 50L, 100L, 100L), industry = c("Shipping", "Shipping", 
"Shipping", "Shipping", "Shipping", "Shipping", "Shipping")),
class = "data.frame", row.names = c(NA, 
-7L))

I have been asked not to use the aggregate function, and to use a tidyverse pipeline instead. Still possible? — Maridee Weber, Jun 08 '20 at 18:14

How to use dplyr to group_by multiple variables and sum other variables

1 Answers1

data