0

I'm resorting to R to handling averageif() and maxif() like functions because my data set is too large and Excel keeps crashing.

I'm looking for a way to find the average wind based on status, and between columns Start Date and End Date from df. I'd imagine this would involve the between function in R.

status<-c("tropical depression", "tropical depression", "tropical storm")
Begin_Date <- as.Date(c("1974/06/01","1980/06/05","1990/06/07"))
End_Date <- as.Date(c("1975/06/01","1981/07/05","1991/08/07"))
df<-as.data.frame(cbind(status,Begin_Date,End_Date))
df$Begin_Date<-as.Date(Begin_Date)
df$End_Date<-as.Date(End_Date)
df$status<-as.character(status)
storms$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")

(Note: my actual problem is billing data, so I would need a solution to gather an average based on IDs, between a start billing date and end billing date).

Take the storms dataset from dplyr as an example.

starting with my df below:

               status Begin_Date   End_Date
tropical depression  1974-06-01 1975-06-01
 tropical depression 1980-06-05 1981-07-05
      tropical storm 1990-06-07 1991-08-07 

What I'd like:

               status Begin_Date   End_Date   Avg Wind
tropical depression  1974-06-01 1975-06-01    44.3
 tropical depression 1980-06-05 1981-07-05    66.7
      tropical storm 1990-06-07 1991-08-07    56

I'm trying to get this to be dplyr compliant as well.

My attempt is erroring out:

df$Average_wind < - storms %>% filter (status = df$status, date >= df$Begin_Date,date<df$End_Date) %>% summarise(avg = mean(wind))

Error in FUN(left) : invalid argument to unary operator

The average wind and max wind values in the "I'd like" example are not accurate, simply for formatting purposes.

longlivebrew
  • 301
  • 3
  • 16
  • 1
    Something like this: `library(dplyr); storms %>% mutate(year.range = cut(year, breaks=seq(min(year), max(year) + 5, 5), include.lowest=TRUE)) %>% group_by(status, year.range) %>% summarise(average.wind=mean(wind, na.rm=TRUE), max.wind=max(wind, na.rm=TRUE))` – eipi10 Jan 23 '18 at 21:37
  • I'll reclarify, my actual data is daily and my intervals are billing periods (which aren't usually very pretty) - so I'm hoping to use the between function. – longlivebrew Jan 23 '18 at 21:40
  • Just `left_join(storms, your_data)` (on `status`) and filter out the rows where `year` isn't in your range. AFAIK, that's the only way to an interval join with `dplyr`. If you're open to other tools, `data.table` supports non-equi joins. – Gregor Thomas Jan 23 '18 at 22:50
  • Gregor do you know one would go about this with a between function? or if it's needed? – longlivebrew Jan 23 '18 at 22:58
  • @Gregor I ended up revising the question to provide reproducible data - please let me know if you have any suggestions - thanks a ton – longlivebrew Jan 23 '18 at 23:34

3 Answers3

2

Ok - I am posting as a new answer because you are now specifying you want dplyr. This may be easier if you don't convert to date - and just make a numerical string

x <- storms
x$date <- as.Date(with(storms, paste(year, month, day,sep="-")), "%Y-%m-%d")

    # with filter
    x %>% 
    filter( date  > as.Date("1975-06-01") & date < as.Date("1976-06-01") ) %>% 
    group_by(  status ) %>%
    summarise(Avg.Win=mean(wind, na.rm=TRUE))

    #with mutuate
    x %>% 
    mutate( times = cut( date , breaks= c( as.Date("1975-06-01")  , as.Date("1976-06-01"), as.Date("1978-06-01")) ) ) %>% 
    group_by( times, status ) %>%
    summarise(Avg.Win=mean(wind, na.rm=TRUE))
MatthewR
  • 2,660
  • 5
  • 26
  • 37
2

Exactly as described in comments: Just left_join(storms, your_data) (on status) and filter out the rows where year isn't in your range.

If you're open to other tools, data.table supports non-equi joins, which will be significantly more efficient on large data.

    left_join(storms, df, by = "status") %>%
        filter(Begin_Date <= date & date <= End_Date) %>%
        group_by(Begin_Date, End_Date, status) %>%
        summarize(avg_wind = mean(wind))
    # # A tibble: 2 x 4
    # # Groups: Begin_Date, End_Date [?]
    #   Begin_Date End_Date   status              avg_wind
    #   <date>     <date>     <chr>                  <dbl>
    # 1 1980-06-05 1981-07-05 tropical depression     26.9
    # 2 1990-06-07 1991-08-07 tropical storm          45.4

There are only 2 rows in the results because apparently there are no tropical depressions in the storms data between 1974-06-01 and 1975-06-01. In fact, the smallest date in storms is 1975-06-27.

You seem very keen on using between. If you want to, you can use it inside filter() instead of my code. It won't change the results.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Gregor this is absolutely perfect! Thank you for the help - this is probably my third or so question you've got the answer tag! (feel free to keep it up, I just posted another ;) :P) – longlivebrew Jan 29 '18 at 22:12
  • Also - I was just in MI a few months ago - checked out Masonic Mansion in downtown Detroit, but was in Traverse City this summer - absolutely beautiful – longlivebrew Jan 29 '18 at 22:13
1

This is one of those things there are a lot of ways to do. here are some base r options

# Using Indexing
x <- data.frame( storms )
x$wind <- as.numeric( x$wind ) 
mean(  x[ x$year %in% 1979:1980 & x$status %in% "hurricane"  , "wind" ]  , na.rm=T )
max(  x[ x$year %in% 1979:1980 & x$status %in% "hurricane"  , "wind" ]  , na.rm=T )

# using aggregate
x$groups <- cut( x$year , c(-Inf , 1979, 1981 , 1985 , Inf ))
x$groups_type <- paste( x$groups , x$status)
aggregate ( x$wind,by= list(x$groups_type) , mean, na.rm=T)
aggregate ( x$wind,by= list(x$groups_type) , max, na.rm=T)
MatthewR
  • 2,660
  • 5
  • 26
  • 37
  • Thank you for the help! I'm looking for a way to specifically look between the start year column and end year column and use a between function. This example is a much more simple example of my actual problem. In my actual problem, I'm trying to get average readings for billing months, and need to get a function that matches an ID, and averages for between a start date and end date. Any ideas? :P – longlivebrew Jan 23 '18 at 22:09
  • the colon in r indicates between look at 5:10 or as.Date( "2016-10-10" ):as.Date("2017-10-10") – MatthewR Jan 23 '18 at 23:26
  • I revised the question! – longlivebrew Jan 23 '18 at 23:32