5

I have a dataset in R of student weekly allowances by class, which looks like:

Year    ID  Class       Allowance
2013    123 Freshman    100
2013    234 Freshman    110
2013    345 Sophomore   150
2013    456 Sophomore   200
2013    567 Junior      250
2014    678 Junior      100
2014    789 Junior      230
2014    890 Freshman    110
2014    891 Freshman    250
2014    892 Sophomore   220

How can I summarize the results by group (Year/Class) to get sum and % (by group)? Getting sum seems easy with ddply by just couldn't get the % by group part right.

It works for sum:

summary <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance))

But it doesn't work for the percentage by group part:

summary <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance),
                 Allow_Pct=Allowance/sum(Allowance))

Ideal result should look like:

 Year     Class Sum_Allow Allow_Pct
 2013  Freshman       210       26%
 2013    Junior       250       31%
 2013 Sophomore       350       43%
 2014  Freshman       360       40%
 2014    Junior       330       36%
 2014 Sophomore       220       24%

I tried ddply from the plyr package, but please let me know of any way that this may work.

jbaums
  • 27,115
  • 5
  • 79
  • 119
user1839897
  • 425
  • 1
  • 10
  • 14
  • you're trying to bind data frames with an uneven number of rows. do `Sum_allow` and `allow_pct` individually and look at the results. You could use head or tail to get the first or last row for each percent – rawr Oct 28 '14 at 20:47

3 Answers3

7

Here's a possible solution using data.table package (assuming your data called df)

library(data.table)
setDT(df)[, list(Sum_Allow = sum(Allowance)), keyby = list(Year, Class)][, 
            Allow_Pct := paste0(round(Sum_Allow/sum(Sum_Allow), 2)*100, "%"), by = Year][]

#    Year     Class Sum_Allow Allow_Pct
# 1: 2013  Freshman       210       26%
# 2: 2013    Junior       250       31%
# 3: 2013 Sophomore       350       43%
# 4: 2014  Freshman       360       40%
# 5: 2014    Junior       330       36%
# 6: 2014 Sophomore       220       24%

Contributed to @rawr, here's a possible base R solution

df2 <- aggregate(Allowance ~  Class + Year, df, sum)
transform(df2, Allow_pct = ave(Allowance, Year, FUN = function(x) paste0(round(x/sum(x), 2)*100, "%")))
#       Class Year Allowance Allow_pct
# 1  Freshman 2013       210       26%
# 2    Junior 2013       250       31%
# 3 Sophomore 2013       350       43%
# 4  Freshman 2014       360       40%
# 5    Junior 2014       330       36%
# 6 Sophomore 2014       220       24%
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
4

So assuming what you want is:

  1. get the sum of the Allowance column in all groups defined by Year AND Class, and
  2. take that sum and divide it by the overall sum in the relevant year

Then this could do the trick in dplyr:

library(dplyr)
my_data <- read.table(header = TRUE,
                      text = 
'Year    ID  Class       Allowance
2013    123 Freshman    100
2013    234 Freshman    110
2013    345 Sophomore   150
2013    456 Sophomore   200
2013    567 Junior      250
2014    678 Junior      100
2014    789 Junior      230
2014    890 Freshman    110
2014    891 Freshman    250
2014    892 Sophomore   220')

summary <- my_data %>%
  group_by(Year) %>%
  summarise(Year_Sum_Allow = sum(Allowance)) %>%
  left_join(x = my_data, y = ., by = 'Year') %>%
  group_by(Year, Class) %>%
  summarise(Sum_Allow = sum(Allowance),
            Allow_Pct = Sum_Allow/first(Year_Sum_Allow))

summary

# Results
Source: local data frame [6 x 4]
Groups: Year

  Year     Class Sum_Allow Allow_Pct
1 2013  Freshman       210 0.2592593
2 2013    Junior       250 0.3086420
3 2013 Sophomore       350 0.4320988
4 2014  Freshman       360 0.3956044
5 2014    Junior       330 0.3626374
6 2014 Sophomore       220 0.2417582

If you're not familiar with dplyr the syntax might seem strange. I suggest taking a look at the introduction. It's a big time-saver.

Edit: I should have added that, if you want the pretty rounded percent formatting in your example output, you can instead substitute Allow_Pct = paste0(round(Sum_Allow/first(Year_Sum_Allow), 2), '%') in the last line.

Edit 2: As jbaums points out, this could be simplified to:

my_data %>% 
  group_by(Year, Class) %>% 
  summarise(sum_allow=sum(Allowance)) %>% 
  mutate(pct_allow=sum_allow/sum(sum_allow))
user1231088
  • 337
  • 3
  • 8
  • 5
    This could be simplified to: `my_data %>% group_by(Year, Class) %>% summarise(sum_allow=sum(Allowance)) %>% mutate(pct_allow=sum_allow/sum(sum_allow))`. – jbaums Oct 28 '14 at 21:07
  • @jbaums, I wonder how the last `mutate` knows that it supposed to be grouped specifically by `year` instead of by the specified `group_by` specified above or not to be grouped at all... – David Arenburg Oct 28 '14 at 21:41
  • 2
    @DavidArenburg, From the [dplyr vignette](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html): "When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset". Thus, after the `summarise`, the grouping variable 'Class' is peeled off, and the data is then grouped 'only' by 'Year'. – Henrik Oct 28 '14 at 21:46
  • @Henrik - was just about to cite that myself, as well as [your other post](http://stackoverflow.com/a/24576703/489704). @David - because of this, it wouldn't work the same way if I'd written `group_by(Class, Year)`. – jbaums Oct 28 '14 at 21:47
  • @Henrik, that is interesting, maybe worth suggesting this to Arun or Matt as a feature request in `data.table` – David Arenburg Oct 28 '14 at 21:49
  • @jbaums, you probably should have posted this as your own solution, that would be a winner for the simpleast code, instead of being burried here under all this wording – David Arenburg Oct 28 '14 at 21:52
  • @David - no biggie. There's no shortage of methods here that do the trick, anyway. :) – jbaums Oct 28 '14 at 21:56
4

You could do this in two steps

my_data <- read.table(header = TRUE,
                      text = "Year    ID  Class       Allowance
2013    123 Freshman    100
2013    234 Freshman    110
2013    345 Sophomore   150
2013    456 Sophomore   200
2013    567 Junior      250
2014    678 Junior      100
2014    789 Junior      230
2014    890 Freshman    110
2014    891 Freshman    250
2014    892 Sophomore   220")

library(plyr)
(summ <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance)))

#   Year     Class Sum_Allow
# 1 2013  Freshman       210
# 2 2013    Junior       250
# 3 2013 Sophomore       350
# 4 2014  Freshman       360
# 5 2014    Junior       330
# 6 2014 Sophomore       220

ddply(summ, .(Year), mutate, Allow_pct = Sum_Allow / sum(Sum_Allow) * 100)

#   Year     Class Sum_Allow Allow_pct
# 1 2013  Freshman       210  25.92593
# 2 2013    Junior       250  30.86420
# 3 2013 Sophomore       350  43.20988
# 4 2014  Freshman       360  39.56044
# 5 2014    Junior       330  36.26374
# 6 2014 Sophomore       220  24.17582

I don't know if it happens for the rest of you, but when I run the original attempt, R crashes rather than throwing a warning. Or if I misspell Allow instead of allow, it crashes. I really hate that; hadley pls fix

base r forever

rawr
  • 20,481
  • 4
  • 44
  • 78
  • Not sure what role plays *base R* in your solution – David Arenburg Oct 28 '14 at 21:10
  • @DavidArenburg the frustration of plyr as described above – rawr Oct 28 '14 at 21:16
  • Oh, I didn't get any error with your code, but `plyr` is basically a wrapper for base R, I don't think there is something written in C there – David Arenburg Oct 28 '14 at 21:18
  • @DavidArenburg when I run this code written by the OP in rstudio, it crashes every time, `ddply(my_data, .(Year, Class), summarize, Sum_Allow = sum(Allowance), Allow_Pct = Allowance/sum(Allowance))` because Sum_allow and Allow_pct return differing number of rows like I mentioned in the comments above. No warning about binding issues, just crash – rawr Oct 28 '14 at 21:23
  • 2
    Ok, I contributed a base R solution especially for you, see my edit on my answer – David Arenburg Oct 28 '14 at 21:25