0

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.

With this perhaps overly complex data frame:

site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)

my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)

I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.

the result would look like this

site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40

The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.

I know I can start off by deleting the column I don't need (SampleID)

my.data$SampleID <- NULL

but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!

user1981275
  • 13,002
  • 8
  • 72
  • 101
Steve T
  • 73
  • 1
  • 11

2 Answers2

1

This works ok:

x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)



library(dplyr)

my.data %>% 
  group_by(site) %>%
  tally %>%
  left_join(x)

  site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1  211 1        5    51       0       0      0       0      0      1      0      0
2  223 4        7    83      13       5      4       8      6      1     14     45
3  257 2        7    18       4       2     33       0      0      1      1     22
4  298 3        8    76       2       4      3       9     13      2      9     40

Or to do it all in dplyr

my.data %>% 
  group_by(site) %>%
  tally %>%
  left_join(my.data) %>%
  group_by(site,moisture,shade,n) %>%
  summarise_each(funs(sum=sum)) %>%
  select(-sampleID)


    site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
  1  211        5    51 1       0       0      0       0      0      1      0      0
  2  223        7    83 4      13       5      4       8      6      1     14     45
  3  257        7    18 2       4       2     33       0      0      1      1     22
  4  298        8    76 3       2       4      3       9     13      2      9     40
jalapic
  • 13,792
  • 8
  • 57
  • 87
  • Thanks your upper example seems to do it and I think I'll be able to understand it. The lower (all in dplyr) doesn't calculate correctly. Thanks for the working solution! – Steve T Aug 31 '14 at 13:41
  • Thanks, that fixed it. As I get into actually implementing this, though, I recieved this error, ugh: Warning in install.packages : package ‘dplyr’ is not available (for R version 2.15.2) – Steve T Aug 31 '14 at 15:00
  • 1
    @SteveT From the dplyr userguide: Depends R (>= 3.0.2) - so you'll need to update your R to the latest version. Just download from CRAN. I highly recommend it ! :) – jalapic Aug 31 '14 at 15:07
  • duh!! I couldn't see the forest for the trees. Thank you! – Steve T Aug 31 '14 at 15:44
1

Try following using base R:

outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric()) 

my.data$basic = with(my.data, paste(site, moisture, shade))

for(b in unique(my.data$basic)) {
        outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
        for(i in 4:11)
             outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}

outdf
  site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1  223        7    83      13       5      4       8      6      1     14     45
2  257        7    18       4       2     33       0      0      1      1     22
3  298        8    76       2       4      3       9     13      2      9     40
4  211        5    51       0       0      0       0      0      1      0      0
rnso
  • 23,686
  • 25
  • 112
  • 234