I have a data frame that describes the sequential movements of animals (ID
column) and the time spent there (start
and end
columns). These movements are recorded over small scales but are classified within larger regions (classification
column), such that an animal can move multiple times within a region before later moving to another region and moving around. They can also stay in one region for the whole time, or never move at all.
The sequence of movements within each region is tracked in the sequent_moves
column (see this question for a more thorough explanation of how these are created). Animals can potentially move back to a region they earlier left. There is also a column of chemical data, Mean_8786Sr
which is related to that region.
I want to collapse this data frame so that I end up with a description of only the regional movements. So, subsetting by Sample
and sequent_moves
I want to keep the minimum start
value and the maximum end
value, ending up with the start and end time within the region. I further want a mean of the chemical data in Mean_8786Sr
. The rest of the columns I want to either keep the minimum value or the factor value as shown in the example code below.
I can do this using by()
, but so far it requires a statement for each column. My actual data has quite a few more columns and many thousand rows. I'm pretty sure there is a faster, more elegant way to do this, perhaps with data.table (since I'm liking what I've seen from that package so far).
Below is my result. Is there a more efficient way to do this?
movement = data.frame(structure(list(start = c(0, 0, 110, 126, 235, 0, 17, 139, 251,
0, 35, 47, 99, 219, 232, 269, 386, 398, 414, 443, 459), end = c(782L,
110L, 126L, 235L, 612L, 17L, 139L, 251L, 493L, 35L, 47L, 99L,
219L, 232L, 269L, 386L, 398L, 414L, 443L, 459L, 765L), Mean_8786Sr = c(0.709269349163555,
0.710120935400909, 0.70934948311875, 0.71042744033211, 0.709296068424668,
0.708621911917647, 0.709358583256557, 0.710189508916071, 0.709257758963636,
0.711148891471429, 0.712470115258333, 0.713742475130769, 0.714572498375,
0.713400790353846, 0.711656338391892, 0.710380629097436, 0.711571667241667,
0.71290867871875, 0.712009033513793, 0.71104293234375, 0.709344687326471
), Sample = c("2006_3174", "2006_3185", "2006_3185", "2006_3185",
"2006_3185", "2006_3189", "2006_3189", "2006_3189", "2006_3189",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194", "2006_3194", "2006_3194", "2006_3194",
"2006_3194", "2006_3194"), ID = c("1", "1", "2", "3", "4", "1",
"2", "3", "4", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12"), return_year = c(2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L), classification = c("CW",
"CW", "SK", "CW", "CW", "SK", "SK", "CW", "CW", "CW", "CW", "CW",
"CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW", "CW"), sequent_moves = c(1L,
1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), Sample_cptID = c("2006_3174 1", "2006_3185 1",
"2006_3185 2", "2006_3185 3", "2006_3185 3", "2006_3189 1", "2006_3189 1",
"2006_3189 2", "2006_3189 2", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1",
"2006_3194 1", "2006_3194 1", "2006_3194 1", "2006_3194 1")), .Names = c("start",
"end", "Mean_8786Sr", "Sample", "ID", "return_year", "classification",
"sequent_moves", "Sample_cptID"), class = "data.frame", row.names = 6:26))
Here is my solution using by():
moves = by(movement_dput, INDICES = c(factor(movement_dput$Sample_cptID)), function (x) {
start = min(x[,"start"])
end = max(x[,"end"])
Mean_8786Sr = mean(x[,"Mean_8786Sr"])
Sample = x[1,"Sample"]
ID = min(x[,"ID"])
return_year = x[1,"return_year"]
classification = x[1,"classification"]
sequent_moves = x[1,"sequent_moves"]
move = cbind(start, end, Mean_8786Sr, Sample, ID, return_year, classification, sequent_moves)
move
}
)
regional_moves = do.call(rbind.data.frame, moves)
regional_moves
Is there,
- a more efficient way to do this?
- an easier or more compact way to specify which columns I want max(), min(), etc...?
Edit: Adding partial data.table solution per Jeannie's comment.
Here is what I have so far using data.table.
require('data.table')
m=setDT(movement)
m[, .(start=base::min(start),
end=base::max(end),
Mean_8786Sr=mean(Mean_8786Sr),
ID = base::min(ID),
return_year = return_year[1],
classification = classification[1],
Sample_cptID = Sample_cptID[1])
, by=c('Sample', 'sequent_moves')]
If I run this without base::min()
I get errors. The current error is:
Error in `g[`(Sample_cptID, 1) : object 'Sample_cptID' not found
in a prior iteration (that didn't work) I got:
Error in gmin(ID) :
GForce min can only be applied to columns, not .SD or similar. To find min of all items in a list such as .SD, either add the prefix base::min(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lapply(.SD,min),by=,.SDcols=]'
Running it with base min()
and max()
functions it works. I'm trying to understand what GForce is really doing in optimizing the speed, I assume that that has something to do with why it isn't returning the functionality I expected. This thread talks about it, but I haven't digested it completely. Any ideas?
It would be nice to be able to pass min, max and mean to a list that I can populate with colnames. The vast majority of columns I just want the first element. It would be more compact if there was a way to specify the max, min and mean columns directly and then say the equivalent of " for every other column, give me the first element".