-1

I have a dataset where there is a record of the rainfalls since 2003. Another dataset contains the information of sampling dates since 2003 until now. I want to sum the amount of rain between the sampling dates (see the object called date.per.year).

I found this but I want to use a vector of values (c1 =sum(rain in interval [X, Y[, c2 =sum(rain in interval [Y, Z[, c3 =sum(rain in interval [Z, A[, etc.)

date.per.year = structure(c(12110, 12460, 12815, 13196, 13564.5, 13930, 14321, 
                            14652, 15028, 15408, 15792, 16106), .Names = c("2003", "2004", 
                                                                           "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
                                                                           "2013", "2014"))

Imagine that the Date and rain data frame is this:

df = data.frame(Dates = seq(as.Date("2003/1/1"), 
                            as.Date("2015/1/1"), "days"), 
                rain = rnorm(length(seq(as.Date("2003/1/1"), as.Date("2015/1/1"), "days"))))

I also tried this, but it's not creating bins that are usable:

## create corresponding intervals
splits <- cut(date.per.year, median, breaks=date.per.year)

Warning message:
In split.default(df$rain, f = splits) :
  data length is not a multiple of split variable


## split df$rain into intervals and sum them
lapply(split(df$rain, f=splits), sum)

Or even this:

library(data.table)
DT <- data.table(df)
setkey(DT, rain, Dates)

DT[, sumSum := DT[ .(.BY[[1]], .d+(-5:-1) )][, sum(sum, na.rm=TRUE)] , by=list(date.per.year, .d=Dates)]
Error in `[.data.table`(DT, , `:=`(sumSum, DT[.(.BY[[1]], .d + (-5:-1))][,  : The items in the 'by' or 'keyby' list are length (12,4384). Each must be same length as rows in x or number of rows returned by i (4384).

DT

An illustration of what I want to do is below. Imagine that the red lines are the dates that are creating the ranges I want to sum (which is the date.per.year object). In the end, I should have 11 values of the sum of the different ranges. Is it possible to do this?

enter image description here

Community
  • 1
  • 1
M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61

1 Answers1

1

You need to supply an origin to convert those numerics to Date Time objects. You will get an error otherwise telling you to do so. Afterwards, cutting based on this variable is simple.

cuts <- as.Date(date.per.year, origin = as.Date("1970/1/1"))
binned <- cut(df$Dates, 
              breaks = cuts)

N.B. Breakpoints are inclusive, so your df$Dates will be NA for the first and last several values.

You will notice that, for example, the unique levels of this date-time factor are

 unique(binned)
 [1] <NA>       2003-02-27 2004-02-12 2005-02-01 2006-02-17
 [6] 2007-02-20 2008-02-21 2009-03-18 2010-02-12 2011-02-23
[11] 2012-03-09 2013-03-28
11 Levels: 2003-02-27 2004-02-12 2005-02-01 ... 2013-03-28

As per group-wise sum, there are thousands of Stack Overflow posts which may help you accomplish this. You could for example,

  df %>% mutate(binned = cut(Dates, breaks =cuts)) %>% 
    group_by(binned) %>% summarize(sum(rain))

# A tibble: 12 x 2
       binned  sum(rain)
       <fctr>      <dbl>
1  2003-02-27   7.996658
2  2004-02-12 -11.950646
3  2005-02-01  30.443479
4  2006-02-17  19.687989
5  2007-02-20  -2.088648
6  2008-02-21  33.837560
7  2009-03-18  -5.039810
8  2010-02-12  -5.235960
9  2011-02-23  -9.806273
10 2012-03-09  -3.887545
11 2013-03-28  30.446548
12         NA  36.634249

remember that the NA in row 12 represents the total sum of rain both before 2003-02-27 and after 2013-03-28.

shayaa
  • 2,787
  • 13
  • 19