Collapsing a data frame by factors with multiple criteria

Question

I have a data frame that describes the sequential movements of animals (ID column) and the time spent there (start and end columns). These movements are recorded over small scales but are classified within larger regions (classification column), such that an animal can move multiple times within a region before later moving to another region and moving around. They can also stay in one region for the whole time, or never move at all.

The sequence of movements within each region is tracked in the sequent_moves column (see this question for a more thorough explanation of how these are created). Animals can potentially move back to a region they earlier left. There is also a column of chemical data, Mean_8786Sr which is related to that region.

I want to collapse this data frame so that I end up with a description of only the regional movements. So, subsetting by Sample and sequent_moves I want to keep the minimum start value and the maximum end value, ending up with the start and end time within the region. I further want a mean of the chemical data in Mean_8786Sr. The rest of the columns I want to either keep the minimum value or the factor value as shown in the example code below.

I can do this using by(), but so far it requires a statement for each column. My actual data has quite a few more columns and many thousand rows. I'm pretty sure there is a faster, more elegant way to do this, perhaps with data.table (since I'm liking what I've seen from that package so far).

Below is my result. Is there a more efficient way to do this?

    movement = data.frame(structure(list(start = c(0, 0, 110, 126, 235, 0, 17, 139, 251, 
0, 35, 47, 99, 219, 232, 269, 386, 398, 414, 443, 459), end = c(782L, 
110L, 126L, 235L, 612L, 17L, 139L, 251L, 493L, 35L, 47L, 99L, 
219L, 232L, 269L, 386L, 398L, 414L, 443L, 459L, 765L), Mean_8786Sr = c(0.709269349163555, 
0.710120935400909, 0.70934948311875, 0.71042744033211, 0.709296068424668, 
0.708621911917647, 0.709358583256557, 0.710189508916071, 0.709257758963636, 
0.711148891471429, 0.712470115258333, 0.713742475130769, 0.714572498375, 
0.713400790353846, 0.711656338391892, 0.710380629097436, 0.711571667241667, 
0.71290867871875, 0.712009033513793, 0.71104293234375, 0.709344687326471
), Sample = c("2006_3174", "2006_3185", "2006_3185", "2006_3185", 
"2006_3185", "2006_3189", "2006_3189", "2006_3189", "2006_3189", 
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194", 
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194", 
"2006_3194", "2006_3194"), ID = c("1", "1", "2", "3", "4", "1", 
"2", "3", "4", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
"11", "12"), return_year = c(2006L, 2006L, 2006L, 2006L, 2006L, 
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L), classification = c("CW", 
"CW", "SK", "CW", "CW", "SK", "SK", "CW", "CW", "CW", "CW", "CW", 
"CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW"), sequent_moves = c(1L, 
1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), Sample_cptID = c("2006_3174 1", "2006_3185 1", 
"2006_3185 2", "2006_3185 3", "2006_3185 3", "2006_3189 1", "2006_3189 1", 
"2006_3189 2", "2006_3189 2", "2006_3194 1", "2006_3194 1", "2006_3194 1", 
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1", 
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1")), .Names = c("start", 
"end", "Mean_8786Sr", "Sample", "ID", "return_year", "classification", 
"sequent_moves", "Sample_cptID"), class = "data.frame", row.names = 6:26))

Here is my solution using by():

moves = by(movement_dput, INDICES = c(factor(movement_dput$Sample_cptID)), function (x) {

  start = min(x[,"start"])
  end = max(x[,"end"])
  Mean_8786Sr = mean(x[,"Mean_8786Sr"])
  Sample = x[1,"Sample"]
  ID = min(x[,"ID"])
  return_year = x[1,"return_year"]
  classification = x[1,"classification"]
  sequent_moves = x[1,"sequent_moves"]
  move = cbind(start, end, Mean_8786Sr, Sample, ID, return_year, classification, sequent_moves)
  move

  }
)
regional_moves = do.call(rbind.data.frame, moves)
regional_moves

Is there,

a more efficient way to do this?
an easier or more compact way to specify which columns I want max(), min(), etc...?

Edit: Adding partial data.table solution per Jeannie's comment.

Here is what I have so far using data.table.

require('data.table')
m=setDT(movement) 
m[, .(start=base::min(start), 
      end=base::max(end), 
      Mean_8786Sr=mean(Mean_8786Sr),
      ID = base::min(ID),
      return_year = return_year[1],
      classification = classification[1],
      Sample_cptID = Sample_cptID[1])
  , by=c('Sample', 'sequent_moves')]

If I run this without base::min() I get errors. The current error is:

Error in `g[`(Sample_cptID, 1) : object 'Sample_cptID' not found

in a prior iteration (that didn't work) I got:

    Error in gmin(ID) : 
      GForce min can only be applied to columns, not .SD or similar. To find min of all items in a list such as .SD, either add the prefix base::min(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lapply(.SD,min),by=,.SDcols=]'

Running it with base min() and max() functions it works. I'm trying to understand what GForce is really doing in optimizing the speed, I assume that that has something to do with why it isn't returning the functionality I expected. This thread talks about it, but I haven't digested it completely. Any ideas?

It would be nice to be able to pass min, max and mean to a list that I can populate with colnames. The vast majority of columns I just want the first element. It would be more compact if there was a way to specify the max, min and mean columns directly and then say the equivalent of " for every other column, give me the first element".

I don't have time to delve deeper in the specifics of this task, but for this kind of operations I would use: `require('data.table'); m=data.table(movement); m[, .(start=min(start), end=max(end), Mean_8786Sr=mean(Mean_8786Sr), classification = classification[1]), by=c('ID', 'sequent_moves')]`. This is not a complete answer but should give you ideas on how to use `data.table` for your problem. — Jealie, Oct 18 '17 at 18:55

Uwe · Accepted Answer · 2017-10-26T05:16:57.170

The OP has asked if there is a more efficient way to aggregate the movement data frame than by specifying each column individually.

I'm afraid that it is unavoidable to specify which columns need to be aggregated by which aggregation function. However, data.table syntax is quite compact in general. So, the call to by() can be implemented with data.table as follows:

library(data.table)
setDT(movement)[
  , .(start = min(start), end = max(end), Mean_8786Sr = mean(Mean_8786Sr), ID = min(ID)), 
  by = .(Sample, return_year, classification, sequent_moves)]

      Sample return_year classification sequent_moves start end Mean_8786Sr ID
1: 2006_3174        2006             CW             1     0 782   0.7092693  1
2: 2006_3185        2006             CW             1     0 110   0.7101209  1
3: 2006_3185        2006             SK             2   110 126   0.7093495  2
4: 2006_3185        2006             CW             3   126 612   0.7098618  3
5: 2006_3189        2006             SK             1     0 139   0.7089902  1
6: 2006_3189        2006             CW             2   139 493   0.7097236  3
7: 2006_3194        2006             CW             1     0 765   0.7120207  1

Note that all variables which are invariant or constant within each group are treated as grouping variables in by = .... This saves some typing but puts the columns in front of the other (aggregated) columns.

Collapsing a data frame by factors with multiple criteria

1 Answers1