Data reshape and aggregate for large database

Question

I have a large dataset that I would like to both reshape and aggregate, and am running into some difficulties.

For example, consider the following, where quotes separate columns in database. Factors are year, habitat, site, and replicate. The fields species and biomass are the data:

"year" "habitat" "site" "replicate" "species" "biomass"
 2010    inner       a          1       sp.1        10
 2010    inner       a          1       sp.3         6
 2010    inner       a          1       sp.1         5
 2010    inner       a          1       sp.2         8
 2010    inner       a          1       sp.4         4
 2010    inner       a          1       sp.5         7
 2010    inner       a          2       sp.3         5
 2010    inner       a          2       sp.2         6
 2010    inner       a          2       sp.5         2
 2010    inner       a          2       sp.1         5
 2010    inner       a          3       sp.4         5
 2010    inner       a          3       sp.3         4
 2010    inner       a          3       sp.6         8
 2010    inner       a          3       sp.2         5
 2010    outer       b          1       sp.1         6
 2010    outer       b          1       sp.3         9
 2010    outer       b          1       sp.3         3
 2010    outer       b          1       sp.2         6
 2010    outer       b          2       sp.5         4
 2010    outer       b          2       sp.1         5
 2010    outer       b          2       sp.1         7
 2010    outer       b          2       sp.2         5
 2010    outer       b          3       sp.4         2
 2010    outer       b          3       sp.6         5
 2010    outer       b          3       sp.2         4
 2010    outer       b          3       sp.1         4

What I would like to do is reshape and aggregate my table while keeping all the factors. So, a species by site matrix, with the ability to keep factors and aggregate at the level of replicate. I've included an example picture below of what I'd like, where the data were filled with a 'sum' function using Excel pivot table. Is there a simple code for this in R using dplyr and groups? tapply and aggregate?

year    habitat site    replicate   sp.1    sp.2    sp.3    sp.4    sp.5    sp.6
2010    inner    a          1        15      8       6       4       7       0
2010    inner    a          3        0       5       4       5       0       8
2010    outer    b          1        6       6       12      0       0       0
2010    outer    b          2        12      5       0       0       4       0
2010    outer    b          3        4       4       0       2       0       5

To help answerers, I'd recommend using `dput` to make it so we can copy/paste the R object data example directly into our workspace. See the first answer here, under heading "Copy your data": http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Will Beason, Aug 17 '14 at 00:54

mgriebe · Answer 1 · 2014-08-17T18:42:08.893

First, if you have a large data set I'd suggest using data table.

I use the function "BreakBuild" routinely, but I cannot test it on your code.

library(data.table)
mydata<-data.table(read.table("mydata"))

aggmydata<- mydata[,list(biomass=sum(biomass)),by=list(year,habitat,site,replicate,species)]

BreakBuild <- function (df,BCol,IDCols,ValCols) {
  setkeyv(df,BCol)
  NewCols <- unique(df[[BCol]]); #Scan the column for unique entries.  Each entry gets a data table.
  ldt <- list(); #List that will hold a data table for each break.
  ColList <-c(IDCols,ValCols) # List of columns, does not include broken collumn.
  for (bframe in NewCols) {
    ldt[[bframe]] = df[bframe,ColList, with=FALSE] #Create and store a data table with columns from IDCols and VolCols.
    if(length(ValCols)>1){setnames(ldt[[bframe]], ValCols, paste(bframe,ValCols,sep="."))} #Prefix the Value columns with the name of the break.
    else {setnames(ldt[[bframe]], ValCols, bframe)}  #If there is only one Value Column, give it the name of the break.
  }
  Build<-Reduce(function(...) merge(...,by=IDCols,all=T),ldt)
  return(Build)
}

waggmydata<- BreakBuild(aggmydata,BCol="species",
                        IDCols=c("year","habitat","site","replicate"),
                        ValCols="biomass")

You can also use reshape2, and dcast, though I am not all that familiar with the syntax of dcast. It would be something like

library(reshape2)
waggmydata<-dcast.data.table(mydata,  year+habitat+site+replicate~ species, fun=sum)

If your data set is very large, (millions and millions of rows) it may be worthwhile to use data.table, though it seems dplyr has made major gains since the last time I tried it out.

Here are the results of the benchmark results for the three methods tested.

require(data.table)
require(reshape2)
require(dplyr)
require(tidyr)
require(microbenchmark)
BreakBuild <- function (df,BCol,IDCols,ValCols) {
  setkeyv(df,BCol)
  NewCols <- unique(df[[BCol]]); #Scan the column for unique entries.  Each entry gets a data table.
  ldt <- list(); #List that will hold a data table for each break.
  ColList <-c(IDCols,ValCols) # List of columns, does not include broken collumn.
  for (bframe in NewCols) {
    ldt[[bframe]] = df[bframe,ColList, with=FALSE] #Create and store a data table with columns from IDCols and VolCols.
    if(length(ValCols)>1){setnames(ldt[[bframe]], ValCols, paste(bframe,ValCols,sep="."))} #Prefix the Value columns with the name of the break.
    else {setnames(ldt[[bframe]], ValCols, bframe)}  #If there is only one Value Column, give it the name of the break.
  }
  Build<-Reduce(function(...) merge(...,by=IDCols,all=T),ldt)
  return(Build)
}

mydata<-data.table(year=sample(c("2010","2011","2012"),1e6,replace=T),
                   habitat=sample(c("inner","outer"),1e6,replace=T),
                   site=sample(letters[1:15],1e6,replace=T),
                   replicate=sample(1:5,1e6,replace=T),
                   species=sample(paste("sp",1:20,sep="."),1e6,replace=T),
                   biomass=sample(1:30,1e6,replace=T))
dat1<- as.data.frame(mydata)


microbenchmark(
  DPLYR= {
    dat1%>% 
      group_by(year, habitat,site,replicate, species)%>%
      summarise(biomass=sum(biomass)) %>%
      spread(species, biomass,fill=0)
  },
  DATATABLE = {
    aggmydata<- mydata[,list(biomass=sum(biomass)),by=list(year,habitat,site,replicate,species)]
    waggmydata<- BreakBuild(aggmydata,BCol="species",
                            IDCols=c("year","habitat","site","replicate"),
                            ValCols="biomass")
  },
  DCAST.DATA.TABLE = {
    waggmydata<-dcast.data.table(mydata,  year+habitat+site+replicate~ species, fun=sum)
  }
  )

##             expr       min       lq   median       uq      max neval
##            DPLYR 168.26559 170.3902 171.9306 173.9712 189.5266   100
##        DATATABLE  97.21738 101.1543 103.5157 108.2527 125.9114   100
## DCAST.DATA.TABLE 184.58250 189.4021 192.0251 195.3731 242.9994   100

Once you know the syntax, I'd say DCAST.DATA.TABLE is the easiest to code for this example.

There's a great explanation of `cast` syntax here: http://seananderson.ca/2013/10/19/reshape.html. In particular, see "Figure 1" towards the bottom of the page. — shadowtalker, Aug 17 '14 at 16:27
Nice, +1. With 1.9.3, there're some speed improvements to `dcast.data.table`. It gives me, `455, 351, 180ms`, in the same order. But it'd be even nicer to benchmark on bigger data :). — Arun, Aug 17 '14 at 22:35

score 1 · Accepted Answer · answered Aug 17 '14 at 07:22

Using dplyr (if dat1 is the dataset)

  library(dplyr)
  library(tidyr)

  dat1%>% 
  group_by(year, habitat,site,replicate, species)%>%
  summarise(biomass=sum(biomass)) %>%
  spread(species, biomass,fill=0)
  # Source: local data frame [6 x 10]

  #  year habitat site replicate sp.1 sp.2 sp.3 sp.4 sp.5 sp.6
  #1 2010   inner    a         1   15    8    6    4    7    0
  #2 2010   inner    a         2    5    6    5    0    2    0
  #3 2010   inner    a         3    0    5    4    5    0    8
  #4 2010   outer    b         1    6    6   12    0    0    0
  #5 2010   outer    b         2   12    5    0    0    4    0
  #6 2010   outer    b         3    4    4    0    2    0    5

BTW: replicate 2 of habitat inner was not shown in the expected output.

Thanks, exactly what I was looking to do. Nice combo of packages! — Peter Houk, Aug 17 '14 at 13:00

Data reshape and aggregate for large database

2 Answers2