Summing rows by index variable in R

Question

I am working with a database of all transit systems in America and trying to compare different agencies. Each case is a specific part of an organization. For example, bus lines are separate from subways. I would like to combine the values for all of the cases for a given agency.

Basically I want to sum the value of each column for each "Trs_Id" and delete the rest. This data frame is the breakdown of operating expenses ("opex"). Here is what my dataset looks like in R:

> colnames(opex)
[1] "Trs_Id"                  "Mode_Cd"                
[3] "Service_Cd"              "Expense_Category_Desc"  
[5] "Op_Sal_Wage_Amt"         "Other_Sal_Wage_Amt"     
[7] "Fringe_Benefit_Amt"      "Service_Costs_Amt"      
[9] "Fuel_Lubricant_Amt"      "Tire_Tube_Amt"          
[11] "Other_Mat_Sup_Amt"       "Utility_Amt"            
[13] "Casuality_Liability_Amt" "Tax_Amt"                
[15] "In_Report_Amt"           "Sep_Report_Amt"         
[17] "Misc_Expense_Amt"        "Expense_Transfer_Amt"   
[19] "Ada_Related_Amt"        
> NROW(opex)
[1] 6956
> id_nm <- table(opex$Trs_Id)
> NROW(id_nm)
[1] 616

Welcome to StackOverflow. Perhaps if you made a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that demonstrates your question / problem, people would find it easier to answer. — Andrie, Aug 06 '12 at 15:46
To the person who silently downvoted: This is the [summer of love](http://blog.stackoverflow.com/2012/07/kicking-off-the-summer-of-love/), so I suggest you do one of a few things 1) Explain why the downvote, 2) Explain to the OP how to improve the question 3) Edit the question so it is a good question. — Andrie, Aug 06 '12 at 15:46
Not my downvote, but that's not what the Summer Of Love is about, @Andrie. — Michael Petrotta, Aug 06 '12 at 15:48
@MichaelPetrotta The summer of love is about trying to keep SO a friendly place. In my view it's unfriendly to downvote a newby without trying to first point them in the right direction. — Andrie, Aug 06 '12 at 16:01
@Andrie I downvoted because it's a poor quality question. The OP didn't put much effort into finding the answer, and didn't even bother getting the data into R. Why should the answerer be expected to create sample data for the OP? In light of the lack of effort from the OP, I don't see why I should have to put effort into explaining that it's a bad question -- that's what downvotes are _for_. (of course, if he or anyone else asked I would/did give my reasoning) — GSee, Aug 06 '12 at 16:09
@GSee I agree with everything you say about it being a poorly researched question. And yes, that is what downvotes are for. I also happily downvote a question if it is clear that a person doesn't respond to prompting for making it better. I am simply trying to get all of us to collectively nudge newbies in the right direction, by giving a bit of help first, then downvote later, if the nudging doesn't help. — Andrie, Aug 06 '12 at 16:12
I just changed the example to R code, thanks for the feedback. — Andrew Brēza, Aug 06 '12 at 18:07
Andrew, while that gives us some information about your data, if you read the link in Andrie's first comment you will see how much better it could be. Good luck — Seth, Aug 06 '12 at 18:25

score 4 · Accepted Answer · answered Aug 06 '12 at 16:12

In spite of you didn't make a reproducible example that demonstrate your problem, I think you can solve it using ddply function from plyr package, but you can also use the base function aggregate. I prefer do it with ddply.

Here's some random numbers for figuring out your data.frame

 #install.packages('plyr')
 library(plyr)

 set.seed(007) # for the example being reproducible
 Mode_Cd1 <- replicate(4,sample(LETTERS[1:26], 1, replace=T)) # random genereation of variable Mode_Cd
 Mode_Cd2 <- replicate(4,sample(LETTERS[1:26], 1, replace=T)) 

 data_set <- data.frame(Trs_Id = rep(paste('00', 1:4, sep=''), each=3),
            Mode_Cd = sample(paste(Mode_Cd1, Mode_Cd2, sep=''), 12, replace=T),
            Service_Cd =  sample(paste(Mode_Cd2, Mode_Cd1, sep=''), 12, replace=T),
            Op_Sal_Wage_Amt = rnorm(12,5000,100),
            Other_Sal_Wage = rnorm(12,3000,800))

 data_set # this is how my random data_set looks like
   Trs_Id Mode_Cd Service_Cd Op_Sal_Wage_Amt Other_Sal_Wage
1     001      ZG         ID        4910.620       2213.558
2     001      KU         UK        4969.267       2779.149
3     001      ZG         ZB        4999.518       2303.319
4     002      ZG         ZB        5098.816       3574.968
5     002      BZ         ZB        5083.975       3088.522
6     002      ZG         GZ        5070.534       2937.227
7     003      KU         ID        5130.596       2663.608
8     003      ZG         UK        4861.200       2550.299
9     003      DI         ZB        5127.292       3798.011
10    004      ZG         UK        5018.419       2115.896
11    004      BZ         ID        5075.228       2886.170
12    004      KU         UK        5059.175       3251.996

 ddply(data_set,.(Trs_Id),numcolwise(sum)) # The sum you want.
  Trs_Id Op_Sal_Wage_Amt Other_Sal_Wage
1    001        14879.40       7296.026
2    002        15253.33       9600.717
3    003        15119.09       9011.918
4    004        15152.82       8254.062

score 2 · Answer 2 · answered Aug 07 '12 at 02:21

2

Using Jilber's data set, you can also handle this with aggregate():

aggregate(data_set[, 4:5], list(data_set$Trs_Id), sum)

For your example, the column numbers would be 5:19 instead of 4:5.

answered Aug 07 '12 at 02:21

dcarlson

10,936
2
15
18

Summing rows by index variable in R

2 Answers2