0

I want to connect two datasets with each other by adding a new column called Average. This column is the average of the durations between Date and Date - diff. I got two datasets, the first one is called data and looks like this:

       Date   Weight   diff   Loc.nr  
2013-01-24     1040       7        2
2013-01-31     1000       7        2
2013-01-19      500       4        9
2013-01-23     1040       4        9
2013-01-28      415       5        9
2013-01-31      650       3        9

The other one is called Rain.duration, in the column Duration are the hours of rain on that day. This dataset looks like this:

      Date  Duration
2013-01-14       4.5
2013-01-15       0.0
2013-01-16       6.9
2013-01-17       0.0
2013-01-18       1.8
2013-01-19       2.1
2013-01-20       0.0
2013-01-21       0.0
2013-01-22       4.3
2013-01-23       0.0
2013-01-24       7.5
2013-01-25       4.7
2013-01-26       0.0
2013-01-27       0.7
2013-01-28       5.0
2013-01-29       0.0
2013-01-30       3.1
2013-01-31       2.8

I made a code to do this:

for(i in 1:nrow(data)) {
  for(j in 1:nrow(Rain.duration)) {
    if(data$Date[i] == Rain.duration$Date[j]) {
      average <- as.array(Rain.duration$Duration[(j-(data$diff[i])):j])

      j <- nrow(Rain.duration)
    }
  }
  data$Average[i] <- mean(average)
}

The problem of this code is that, because of the size of my datasets, it takes like 3 days to run. Is there a faster way to do this?

My expected outcome is:

       Date   Weight   diff   Loc.nr   Average
2013-01-24     1040       7        2      1.96
2013-01-31     1000       7        2      2.98
2013-01-19      500       4        9      2.16
2013-01-23     1040       4        9      1.28
2013-01-28      415       5        9      2.98
2013-01-31      650       3        9      2.73
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Perhaps I'm misreading your code, but are you trying to get the average duration for each date? The averages you have in your expected output aren't the average duration for each date, but maybe that's just filler data? – crazybilly Nov 15 '17 at 17:12
  • I want the average of the interval between `Date` and `Date - diff` from `data`. – Bas van der Bijl Nov 15 '17 at 17:22
  • Im not understanding how `Weather` comes into play here. How is the `Duration` column related to what you're trying to calculate? – acylam Nov 15 '17 at 17:42
  • I just edited the question, but the `duration` is the number of hours of rain on that day. So I want to calculate the average number of hours of rain between the `data$Date` and `data$Date - data$diff`. – Bas van der Bijl Nov 15 '17 at 18:12
  • This sounds similar to [my question](https://stackoverflow.com/q/7188807/23118), does any of the answers help? – hlovdal Nov 17 '17 at 18:29

2 Answers2

0

Here's a dplyr solution:

library(dplyr)

# add row number as a new column just to make it easier to read
weather_with_rows  <- Weather %>%
    mutate(Rownum = row_number())

# write function to filter by row number, then return the average duration
getavgduration  <- function(mydate, mydiff) {

    myrow = weather_with_rows %>%
         filter(Date == mydate) %>%
         pluck("Rownum")

    mystartrow = myrow -mydiff

    myduration = weather_with_rows %>%
        filter(
              Rownum <= myrow
            , Rownum >= mystartrow
        )

    mean(myduration$Duration)

}

# get the average duration for each Date/diff pair
averages  <- data %>%
    group_by(Date, Diff) %>%
    summarize(Average = getavgduration(Date, Diff)) %>%
    ungroup()


# join this back into the original data frame
#    this step might not be necessary 
#    and might be a big drag on performance, 
#    depending on the size of your real data
data_with_avg_duration  <- data %>%
    left_join(averages, by = c('Date','Diff')
crazybilly
  • 2,992
  • 1
  • 16
  • 42
  • Than you for your solution. If I run this code I recieve an error at: `averages <- data %>% group_by(Date, diff) %>% summarize(Average = getavgduration(Date, diff)) %>% ungroup()` The error says: `Error in summarise_impl(.data, dots) : subscript out of bounds` Do you know how to solve this? – Bas van der Bijl Nov 15 '17 at 20:08
  • I might have got the column names wrong, with regard to capitalization (or the duration column in the Weather df), inside the `getavgduration()` function. Looks like I capitalized "Date" in both data frames and did not capitalize "duration". That error implies that a named column is not found and since R is case-sensitive, I might be missing something. – crazybilly Nov 15 '17 at 21:13
  • I edited my answer to make all the column names capitalized, consistently across both data frames. – crazybilly Nov 15 '17 at 21:22
0

This old question does not have an accepted answer yet, so I feel obliged to post an alternative solution which aggregates in a non-equi join.

The OP has requested to compute the average duration of rain from a table Rain.duration of daily hours of rain fall for each date interval given in data.

library(data.table)
# make sure Date columns are of class Date
setDT(data)[, Date := as.Date(Date)]
setDT(Rain.duration)[, Date := as.Date(Date)]
# aggregate in a non-equi join and assign the result to a new column
data[,  Average := Rain.duration[data[, .(upper = Date, lower = Date - diff)], 
            on = .(Date <= upper, Date >= lower), 
            mean(Duration), by  = .EACHI]$V1][]
         Date Weight diff Loc.nr  Average
1: 2013-01-24   1040    7      2 1.962500
2: 2013-01-31   1000    7      2 2.975000
3: 2013-01-19    500    4      9 2.160000
4: 2013-01-23   1040    4      9 1.280000
5: 2013-01-28    415    5      9 2.983333
6: 2013-01-31    650    3      9 2.725000

The key part is

Rain.duration[data[, .(upper = Date, lower = Date - diff)], 
              on = .(Date <= upper, Date >= lower), 
              mean(Duration), by  = .EACHI]
         Date       Date       V1
1: 2013-01-24 2013-01-17 1.962500
2: 2013-01-31 2013-01-24 2.975000
3: 2013-01-19 2013-01-15 2.160000
4: 2013-01-23 2013-01-19 1.280000
5: 2013-01-28 2013-01-23 2.983333
6: 2013-01-28 2013-01-23 2.983333
7: 2013-01-31 2013-01-28 2.725000

which does a non-equi join with the date ranges derived from data:

data[, .(upper = Date, lower = Date - diff)]
        upper      lower
1: 2013-01-24 2013-01-17
2: 2013-01-31 2013-01-24
3: 2013-01-19 2013-01-15
4: 2013-01-23 2013-01-19
5: 2013-01-28 2013-01-23
6: 2013-01-28 2013-01-23
7: 2013-01-31 2013-01-28

by = .EACHI requests to compute the aggregate mean(Duration) for each date interval on-the-fly which avoids to create and copy temporay subsets.

Note that this solution will give correct answers even if Rain.duration has gaps or is unordered as it relies only on Date as opposed to the other solutions which use row numbers.

Uwe
  • 41,420
  • 11
  • 90
  • 134