subset dataframe

Question

I have a dataframe with counts of geese at several different sites. The aim was to make monthly counts of geese in all 8 months between September-April at each site in consecutive winter periods. A winter period is defined as the 8 months between September-April.

If the method had been carried out as planned, this is what the data would look like:

df <- data.frame(site=c(rep('site 1', 16), rep('site 2', 16), rep('site 3', 16)),
                   date=dmy(rep(c('01/09/2007', '02/10/2007', '02/11/2007', 
                              '02/12/2007', '02/01/2008',  '02/02/2008', '02/03/2008',  
                              '02/04/2008', '01/09/2008', '02/10/2008', '02/11/2008', 
                                  '02/12/2008', '02/01/2009',  '02/02/2009', '02/03/2009',  
                                  '02/04/2009'),3)),
                   count=sample(1:100, 48))

Its ended up with a situation where some sites have all 8 counts in some September-April periods, but not in other September-April periods. In addition, some sites, never achieved 8 counts in a September-April period. These toy data look like my actual data:

df <- df[-c(11:16, 36:48),]

I need to remove rows from the dataframe which do not form part of 8 consecutive counts in a September-April period. Using the toy data, this is the dataframe I need:

df <- df[-c(9:10, 27:29), ]

I've tried various commands using ddply() from plyr package but without success. Is there a solution to this problem?

what's < 8 monthly counts? 8 observations or count < 8? your output seems to be satisfying neither... — Arun, Mar 18 '13 at 09:29
Your question is not clear enough, subset are easy in R so please reformulate and there is high probability we answer you quickly — statquant, Mar 18 '13 at 09:33
-1 for lack of clarity in subject as well as question also shows no research effort — CHP, Mar 18 '13 at 09:34
Hi all, thanks for feedback. Question re-designed from bottom up. Let me know if still unclear. This has been a tricky problem to translate into words. — luciano, Mar 18 '13 at 11:08
Have various ways of using `nrow()` with `ddply()`, like: `library(lubridate); library(plyr); ddply(df, .(site, month(date)), nrow)`. However, as you can see, this hasn't really got me close to desired output. — luciano, Mar 18 '13 at 11:33

score 3 · Accepted Answer · edited May 23 '17 at 10:32

One way I could think of is to subtract four months from your date so that, then you could group by year. To get the corresponding date by subtracting by 4 months, I suggest you use mondate package. See here for an excellent answer as to what problem you'd face when you subtract month and how you can overcome it.

require(mondate)
df$grp <- mondate(df$date) - 4
df$year <- year(df$grp)
df$month <- month(df$date)
ddply(df, .(site, year), function(x) {
    if (all(c(1:4, 9:12) %in% x$month)) {
        return(x)
    } else {
        return(NULL)
    }
})

#      site       date count        grp year month
# 1  site 1 2007-09-01    87 2007-05-02 2007     9
# 2  site 1 2007-10-02    44 2007-06-02 2007    10
# 3  site 1 2007-11-02    50 2007-07-03 2007    11
# 4  site 1 2007-12-02    65 2007-08-02 2007    12
# 5  site 1 2008-01-02    12 2007-09-02 2007     1
# 6  site 1 2008-02-02     2 2007-10-03 2007     2
# 7  site 1 2008-03-02   100 2007-11-02 2007     3
# 8  site 1 2008-04-02    29 2007-12-03 2007     4
# 9  site 2 2007-09-01     3 2007-05-02 2007     9
# 10 site 2 2007-10-02    22 2007-06-02 2007    10
# 11 site 2 2007-11-02    56 2007-07-03 2007    11
# 12 site 2 2007-12-02     5 2007-08-02 2007    12
# 13 site 2 2008-01-02    40 2007-09-02 2007     1
# 14 site 2 2008-02-02    15 2007-10-03 2007     2
# 15 site 2 2008-03-02    10 2007-11-02 2007     3
# 16 site 2 2008-04-02    20 2007-12-03 2007     4
# 17 site 2 2008-09-01    93 2008-05-02 2008     9
# 18 site 2 2008-10-02    13 2008-06-02 2008    10
# 19 site 2 2008-11-02    58 2008-07-03 2008    11
# 20 site 2 2008-12-02    64 2008-08-02 2008    12
# 21 site 2 2009-01-02    92 2008-09-02 2008     1
# 22 site 2 2009-02-02    69 2008-10-03 2008     2
# 23 site 2 2009-03-02    89 2008-11-02 2008     3
# 24 site 2 2009-04-02    27 2008-12-03 2008     4

An alternative solution using data.table:

require(data.table)
require(mondate)
dt <- data.table(df)
dt[, `:=`(year=year(mondate(date)-4), month=month(date))]
dt.out <- dt[, .SD[rep(all(c(1:4,9:12) %in% month), .N)], 
           by=list(site,year)][, c("year", "month") := NULL]

subset dataframe

1 Answers1