Modification of "data.table" to give first 10% of each group

Question

So I have a code here:

library(data.table)
setDT(df)[, .SD[which.min(Julian_Day)]., (species,Year)]

Example of the df:

df=data.frame(
  year=c(1901,1901,1901,1901,1901,1901,1901,1901,1901,1901,1901,1901,1901),
  temp=c(29,25,21,26,20,20,26,25,24,23,23,24,26),
  habitat=c("fst","fld","city","city","fst","fld","fst","road","river","river","city","city","city"),
  species=c("blu","blu","pink","pink","pink","pink","pink","pink","pink","pink","pink","pink","pink"),
  day= c(34,87,93,79,56,98,100,187,54,14,63,57,23))

what I want the new subset to look like:

dfout <- data.frame(
       year=c(1901,1901,1901),
       temp=c(29,25,21),
       habitat=c("fst","fld","river"),
       species=c("blu","blu","pink"),
       day=c(34,87,14),
       first10= c(NA,NA,23)
)

So this new subset would give me a new row with the mean temp for the first 10%(based on day) of the observations for EACH species for EACH year ( I have from years 1901-2000 and 100 species). As can be seen from above, the blu species only had 2 observations for 1901, therefore there is not enough data to give a mean for the first10% so NA is returned. Secondly, the observations that were not used to calculated the first 10% of observations were omitted from the new subset. If there were say, 30 observations of the pink species in 1901, then 3 rows would have been returned in the new subset, all with the same values in the first10% column.

I don't know why this has attracted downvotes. If y'all think it has been asked before, please point out the duplicate. — Frank, Apr 05 '15 at 04:01
@Frank my guess it was doewnvoted because the example isn't reproducible? — David Arenburg, Apr 05 '15 at 09:15
Thanks for posting an example, but could you modify it so it is easy to copy-paste into the R console? — Frank, Apr 06 '15 at 02:29
sure, i'm just not sure what i need to change. sorry, i'm really new to this all. — John, Apr 06 '15 at 02:38
Fyi, you can check if you fixed it by copy-pasting it into the R console yourself. Your code still did not work, but I've fixed it so now it does (adding commas and quotes where needed). — Frank, Apr 06 '15 at 15:11

score 4 · Accepted Answer · edited May 23 '17 at 10:24

4

The special variable .N stores the number of observations in the subset for each (species,Year) group, so you can select .SD[(1:.N)/.N < .05].

Alternately, it is more efficient to avoid .SD, which can be done here using

setDT(df)
df[df[,.I[(1:.N)/.N < .05],.(species,Year)]$V1]

.I is another special variable, holding row numbers in df. I borrowed this way of using .I from @eddi's answer here. Both .N and .I can be read about in the documentation by typing ?data.table.

Update. In light of your more complicated request, I'm appending to my original answer:

df[,{
    r10s     <- 1:.N/.N < .1
    myrows   <- if(sum(r10s)>0){r10s}else{TRUE}
    c(
        .SD[myrows],
        list(first10=mean(day[r10s]))
    )
},.(species,year)]

This returns NaN for first10 when the mean cannot be computed, as is standard in R:

   species year temp habitat day first10
1:     blu 1901   29     fst  34     NaN
2:     blu 1901   25     fld  87     NaN
3:    pink 1901   21    city  93      93

edited May 23 '17 at 10:24

Community

1
1

answered Apr 05 '15 at 04:00

Frank

66,179
8
96
180

1

Fairly sure, yes; I mean, it makes sense to me that working with `.I` split up by the grouping variables should be faster than `.SD` in the first call, especially if there are many columns. And in the second call, subsetting by row numbers should be pretty cheap. `$` costs nothing, right? It just grabs an element from a list by string matching names in that list. – Frank Apr 05 '15 at 16:27
I just tried this code in R and it looks like it could work but when I try making it a new dataframe, and then I type in that dataframe name it returns back "NULL" and it doesn't give a new column with the mean for each species for each year... – John Apr 06 '15 at 00:19
@John Okay. If you can make what you see into a reproducible example, I bet we can find a solution. – Frank Apr 06 '15 at 01:11
okay, so I fixed up my question but this time i made it for the first 10% of observations instead of 5%. please feel free to ask any questions! – John Apr 06 '15 at 02:09
@John I've updated my answer to match your output. If this isn't exactly what you're looking for, I hope it will be a useful guide at least. If you have another substantially different question after this, it'd be best to post it as a separate question, I think. – Frank Apr 06 '15 at 15:29

Modification of "data.table" to give first 10% of each group

1 Answers1