1

I'm trying to duplicate what you'd do with AVERAGEIFS function in Excel on my dataset:

EG_df <- data.frame(id = c("red_blue", "white_blue", "red_yellow","white_yellow", "brown_blue", "brown_yellow"), 
                    StartDate = as.Date(c('2019-1-1','2019-3-1','2019-7-1','2018-1-1','2018-3-1','2018-7-1')),
                    EndDate = as.Date(c('2019-6-1','2019-12-1','2019-8-1','2018-1-1','2018-3-1','2018-7-1')),
                    avg_Value = NA
                    )

source <- data.frame(source.id = c("red_blue", "red_blue", "red_blue","brown_yellow", "brown_yellow", "brown_yellow"),
                      source.Date = as.Date(c('2019-1-1','2019-2-1','2019-3-1','2018-7-1','2018-8-1','2018-9-1')),
                     source.Value = c(22,56,32,31,14,7)
                    )

Logic I need to fill in EG.df$avg_Value :

For each row in EG_df, return the average value of source.value when source.Date is between StartDate and EndDate.

The Excel Formula, for clarification:

=AVERAGEIFS(source.value, source.id, id, source.Date, ">="&StartDate, source.Date, ">="&EndDate)

Any help would be greatly appreciated!

Nick
  • 667
  • 7
  • 19
Mirko Gagliardi
  • 69
  • 1
  • 10

4 Answers4

1

You can do this pretty efficiently with a non-equi join:

library(data.table)
setDT(source); setDT(EG_df)

EG_df[, avg_Value := 
  source[copy(.SD), on=.(source.id = id, source.Date >= StartDate, source.Date <= EndDate), mean(x.source.Value), by=.EACHI]$V1
]

             id  StartDate    EndDate avg_Value
1:     red_blue 2019-01-01 2019-06-01  36.66667
2:   white_blue 2019-03-01 2019-12-01        NA
3:   red_yellow 2019-07-01 2019-08-01        NA
4: white_yellow 2018-01-01 2018-01-01        NA
5:   brown_blue 2018-03-01 2018-03-01        NA
6: brown_yellow 2018-07-01 2018-07-01  31.00000

(There are NAs since I'm just using the excerpt source provided rather than the full table.)

How it works

x[i, j] subsets using i and then evaluates j, inside of which .SD refers to the Subset of Data.

When x and i are both tables, x[i, on=, j, by=.EACHI] is a join, with on= specifying the join conditions, and j evaluated for each row of i.

Because j = mean(x.source.Value) returns an unnamed column, it gets the default name of V1.

Inside j of x[i, j], v := val creates or modifies column v by assigning val to it.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

Using the dplyr Librarie

library(dyplr)

df = EG_df %>% 
     left_join(source, by = c('id' = 'source.id')) %>% 
     filter((StartDate <= source.Date) & (source.Date <= EndDate)) %>% 
     group_by(id, StartDate, EndDate) %>% 
     summarise(value = mean(source.Value))
Wil
  • 3,076
  • 2
  • 12
  • 31
MathLal
  • 382
  • 3
  • 12
0

using the tidyverse

dplyr::inner_join(source,EG_df,by = c("source.id"="id")) %>%
  dplyr::filter(source.Date >= StartDate,
                source.Date <= EndDate) %>%
  dplyr::group_by(source.id,StartDate,EndDate) %>%
  dplyr::summarise(avg_Value = mean(source.Value))
Wil
  • 3,076
  • 2
  • 12
  • 31
  • If you're going to write an answer all with one library, I'd recommend loading the package `library(dplyr)` at the top of your code, and removing `dplyr::` since the use of `dplyr` is implicit. Typically the `package::function` notation is useful if you're using multiple libraries, and want to add clarity to where the function comes from. – Mako212 Mar 20 '19 at 16:57
  • that's a fair point. I generally take the approach that I don't know what is occurring elsewhere in their code, so I am explicit to prevent the namespace collision. – Wil Mar 20 '19 at 16:59
  • True, although in the context of this website, OP should be providing a [mcve], which should give us the context to anticipate any namespace collision, without having to preemptively account for it. – Mako212 Mar 20 '19 at 17:02
  • Hi @Wil , thank you! My 'EG_df' has many more columns than the few in the example, and when selecting some of those to be added in 'summarise()' it does not accept one of those, returning the error: 'Error: Column anotherID is unknown' _Could it be because_ 'anotherID' _is present in both_ 'EG_df' _and_ 'source' ? If that's the case, how should I properly indicate the 'EG_df$anotherID' column? (I tried with the dollar sign and that's what I get: 'Error: Column anotherID must be length 34260 (the number of rows) or one, not 23160' Thank you in advance for any further help! – Mirko Gagliardi Mar 21 '19 at 10:51
  • What is the role of anotherID in your data? Is it a unique identifier, a measurement of some sort that needs to be aggregated, or just another attribute of the data that you would like included in the output? This [dplyr article](https://dplyr.tidyverse.org/articles/two-table.html) may be helpful for you as well. – Wil Mar 21 '19 at 11:38
0

Consider the base package running a merge > subset > aggregate for the averages by id group and date range. Then merge this resultset back to original dataset.

# MERGE > SUBSET > AGGREGATE
agg_df <- aggregate(cbind(avgValue=source.Value) ~ id + StartDate + EndDate,
                    subset(merge(EG_df, source, by.x="id", by.y="source.id", all.x=TRUE),
                           source.Date >= StartDate & source.Date <= EndDate),
                    FUN=mean)

# MERGE WITH ORIGINAL DATASET
merge(EG_df, agg_df, by=c("id", "StartDate", "EndDate"), all.x=TRUE)

#             id  StartDate    EndDate avgValue
# 1   brown_blue 2018-03-01 2018-03-01       NA
# 2 brown_yellow 2018-07-01 2018-07-01 31.00000
# 3     red_blue 2019-01-01 2019-06-01 36.66667
# 4   red_yellow 2019-07-01 2019-08-01       NA
# 5   white_blue 2019-03-01 2019-12-01       NA
# 6 white_yellow 2018-01-01 2018-01-01       NA

Rextester Demo


Aside - This is similar to SQL's greatest-n-per-group problem (official StackOverflow tag) where the agg_df would be a subquery or CTE joined back to original table.

Parfait
  • 104,375
  • 17
  • 94
  • 125