1

My dataframe has 2 categorical variables, and one is of a lower hierarchy than the other. I want to sum the numerical value of all rows within the sub-category using dplyr.

Thank you in advance for all those who can help me with!

This is the dataframe I start with:

transportation <- data.frame(
  Country = c("A", "A", "A", "B", "B", "B"),
  Mode = c("Car", "Train", "Plane", "Car", "Train", "Plane"),
  Energy = c(10000, 9000, 20000, 200000, 160000, 450000)
)

Starting Dataframe

And this is the dataframe I want to end up with:

country_sum <- data.frame(
  Country = c("A", "A", "A", "B", "B", "B"),
  Mode = c("Car", "Train", "Plane", "Car", "Train", "Plane"),
  Energy = c(10000, 9000, 20000, 200000, 160000, 450000),
  country_sum = c(39000, 39000, 39000, 810000, 810000, 810000)
)

Final Dataframe

Nguyen Tran
  • 89
  • 1
  • 11
  • 1
    Why not? Adding a column with a grouped sum to the original data seems an exact duplicate? What do you see as the difference? – Gregor Thomas Apr 19 '22 at 17:58
  • 1
    The top answer at my duplicate has code identical to your answer, except for `mean` instead of `sum` (and of course the appropriate data frame and column names changed). – Gregor Thomas Apr 19 '22 at 18:00
  • 2
    No offense taken, I just really didn't understand why you **wouldn't** consider it a duplicate. I appreciate your clarification. I actually like that FAQ a lot because the title says "mean, sum, or other summary stat", and several of the answers demonstrate the same method with different summary functions. I think it's good at teaching the general concept, and if someone asked "how do I find the maximum value by group and add it to the original data" I think it's nicely educational to say "See this FAQ, and just swap out `max()` for `mean()`... – Gregor Thomas Apr 19 '22 at 18:30
  • Ok. Thank you very much for your kind explanation. I will consider this! – TarJae Apr 19 '22 at 18:31
  • 1
    By contrast, I dislike the "summary" versions of this FAQ where we have both a [mean by group FAQ](https://stackoverflow.com/q/11562656/903061) and a [sum by group FAQ](https://stackoverflow.com/q/1660124/903061). They have 9 and 18 answers respectively, and except for a few special cases answers from one could be used for the other, simply replacing the function. The top answer at "mean by group" is the best IMHO, but sum by group FAQ has many more upvotes. I think consolidating them would have been good, focusing on a single high quality source of answers, but it's probably too late now. – Gregor Thomas Apr 19 '22 at 18:32

2 Answers2

3

First group by Country and then mutate with sum:

library(dplyr)

transportation %>% 
  group_by(Country) %>% 
  mutate(country_sum = sum(Energy))

 Country Mode  Energy country_sum
  <chr>   <chr>  <dbl>       <dbl>
1 A       Car    10000       39000
2 A       Train   9000       39000
3 A       Plane  20000       39000
4 B       Car   200000      810000
5 B       Train 160000      810000
6 B       Plane 450000      810000
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

Using ave.

dplyr::mutate(transportation, c_sum=ave(Energy, Country, FUN=sum))
#   Country  Mode Energy  c_sum
# 1       A   Car  10000  39000
# 2       A Train   9000  39000
# 3       A Plane  20000  39000
# 4       B   Car 200000 810000
# 5       B Train 160000 810000
# 6       B Plane 450000 810000
jay.sf
  • 60,139
  • 8
  • 53
  • 110