0

Possible Duplicate:
faster way to create variable that aggregates a column by id

I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!

KeyItem     Bank    Country Year    Value  
1   Pretax Income   WELLS_FARGO_&_COMPANY   UNITED STATES   2011    2.365600e+10                            
2   Total Assets    WELLS_FARGO_&_COMPANY   UNITED STATES   2011    1.313867e+12                            
3   Total Liabilities   WELLS_FARGO_&_COMPANY   UNITED STATES 2011  1.172180e+12                            
Community
  • 1
  • 1
MarMarko
  • 35
  • 2
  • 3
  • 9
  • 4
    Welcome to SO, this question has been asked many times before here. Try http://stackoverflow.com/questions/8225621/faster-way-to-create-variable-that-aggregates-a-column-by-id for example – mnel Nov 02 '12 at 11:29
  • Welcome to SO. @mnel is correct--questions about aggregation have been asked many times here. For your problem, you have to both aggregate and subset your data. You can either aggregate first and subset later, or, if your dataset is very large, subset first and then aggregate (which is what I demonstrated in my answer, but not so much to take the joy of learning through experimentation away from you). Also, as your mandatory reading, most users here generally respond more quickly to [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – A5C1D2H2I1M1N2O1R2T1 Nov 02 '12 at 12:28

1 Answers1

1

The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.

First, some sample data.

set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"), 
                                times = 30),
                  Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"), 
                             each = 30),
                  Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
                  Year = rep(c(2000:2009), each = 3, times = 3),
                  Value = runif(90, min=300, max=600))

Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.

aggregate(Value ~ Country + Year, 
          dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ], 
          mean)
#         Country Year    Value
# 1         India 2001 399.7184
# 2  UnitedStates 2001 464.1638
# 3         India 2002 443.5636
# 4  UnitedStates 2002 560.8373
# 5         India 2003 562.5964
# 6  UnitedStates 2003 370.9591
# 7         India 2004 404.0050
# 8  UnitedStates 2004 520.4933
# 9         India 2005 567.6595
# 10 UnitedStates 2005 493.0583

Here's the same thing in data.table

library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005), 
    mean(Value), by = list(Country, Year)]
#          Country Year       V1
#  1:        India 2001 399.7184
#  2:        India 2002 443.5636
#  3:        India 2003 562.5964
#  4:        India 2004 404.0050
#  5:        India 2005 567.6595
#  6: UnitedStates 2001 464.1638
#  7: UnitedStates 2002 560.8373
#  8: UnitedStates 2003 370.9591
#  9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485