0

I have two dataframes,

I would like to obtain the average of speed variables (data1 dataframe) for the 3 hours preceding the date in my dataframe 2 (data2 dataframe)

For example i would like the mean of the speed between the date 2021-02-15 01:00:00 and the date find in my dataframe 2: 2021-02-15 04:00:00 (- 3 hours); an other example is between the 2021-02-15 07:00:00 GMT and the 2021-02-15 10:00:00...

Here is my two dataframes:

Speed <- c(7,9,9,4,2,5,2,NA,1,14,3,4,10,9,34,14,25)

Date <- c("2021-02-15 01:00:00 GMT","2021-02-15 02:00:00 GMT","2021-02-15 03:00:00 GMT","2021-02-15 04:00:00 GMT", "2021-02-15 05:00:00 GMT" ,"2021-02-15 06:00:00 GMT","2021-02-15 07:00:00 GMT","2021-02-15 08:00:00 GMT", "2021-02-15 09:00:00 GMT","2021-02-15 10:00:00 GMT","2021-02-15 11:00:00 GMT" ,"2021-02-15 12:00:00 GMT","2021-02-15 13:00:00 GMT","2021-02-15 14:00:00 GMT", "2021-02-15 15:00:00 GMT","2021-02-15 16:00:00 GMT", "2021-02-15 17:00:00 GMT")

Date <-as.POSIXlt(Date,tz="GMT",format="%Y-%m-%d %H:%M:%S")

data1 <- data.frame(Speed, Date)


search <- as.Date(c("2021-02-15 04:00:00","2021-02-15 10:00:00","2021-02-15 17:00:00"))
search <-as.POSIXlt(search ,tz="GMT",format="%Y-%m-%d %H:%M:%S")

data2 <- data.frame(search)

How can I do?

I know lubridate is a good library for working on dates and I would like to find a solution using it, but I couldn't figure out how to do.

My expected output will be:

Dates                   average Speed
2021-02-15 04:00:00     average speed between this date and -3hours
2021-02-15 10:00:00     average speed between this date and -3hours
2021-02-15 17:00:00     average speed between this date and -3hours

I have a big dataset

r2evans
  • 141,215
  • 6
  • 77
  • 149
lpr
  • 1
  • 2
  • You included the [tag:datatable] tag which is ambiguous, often referring to a .NET or javascript library. I suspect you may have intended R's `data.frame`, though you already listed [tag:dataframe]; some use it instead of [tag:data.table] for R's `data.table` package, but your code does not suggest that package, so I'm going to assume you're just looking for an R solution to something sitting in one or more `data.frame` objects. Is that correct? – r2evans Jul 11 '23 at 12:52

1 Answers1

0

This can be solved with a combination of

Unfortunately, your data2 all occur before all samples in data1 (and are all the same values). I'll use my own data2a to demonstrate.

data2a <- data.frame(search = data1$Date[c(2, 5, 9, 17)])
data2a
#                search
# 1 2021-02-15 02:00:00
# 2 2021-02-15 05:00:00
# 3 2021-02-15 09:00:00
# 4 2021-02-15 17:00:00

I'll use dplyr to demonstrate the non-equi join and summarization:

library(dplyr) # dplyr_1.1 or newer
data2a %>%
  mutate(id = row_number(), search_m3 = search - 3600*3) %>%
  left_join(data1, join_by(between(y$Date, x$search_m3, x$search))) %>%
  summarize(avg = mean(Speed, na.rm = TRUE), .by = c(id, search))
#   id              search       avg
# 1  1 2021-02-15 02:00:00  8.000000
# 2  2 2021-02-15 05:00:00  6.000000
# 3  3 2021-02-15 09:00:00  2.666667
# 4  4 2021-02-15 17:00:00 20.500000

Validation:

  • 2021-02-15 02:00:00 (including from 23:00 the night prior) contains just two values, 7 and 9, average is 8;
  • 2021-02-15 05:00:00 (from 02:00:00) includes four values (because this is 3 hours inclusive): 9, 9, 4, and 2, the average is 6;
  • 2021-02-15 09:00:00 includes four values: 5, 2, NA, and 1, mean(c(5,2,NA,1), na.rm=TRUE) is 2.67; and
  • 2021-02-15 17:00:00 includes four values: 9, 34, 14, 25, mean(c(9, 34, 14, 25)) is 20.5.

Walk-through:

  • I define id so that we have an unambiguous way to return to the individual rows from data2a; in this case, we could have used search as the key, but I've had non-equi join situations where there are legitimate duplicates in one of the valued terms, so I tend to use this unambiguous mechanism to return to the correct order and such;
  • search_m3 is "search minus 3 (hours)", most (all?) non-equi join mechanisms in R need both ends as clear variables;
  • left_join is one of a few merge/join variables (see How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?, (data.table) Left join using data.table), resulting in at least nrow(data2a) rows, likely (in this case certainly) more rows in an expansion;
  • join_by (recent in dplyr) allows us to define the behavior; here I'm using between(..., search_m3, search), though one could also have used join_by(search_m3 < Date, search >= Date) for a left-open join (so that 5am includes data after 2am but not 2am itself)
  • summarize using the .by= grouping field returns just the .by variables and the one calculated field

The summarize-by-group is easily done in all dialects of R (base, dplyr, and data.table), but the non-equi join is not done "easily" in base R; it can also be done fairly directly in dplyr (shown above), data.table, or with the fuzzyjoin package or anything that supports SQL (sqldf, RSQLite, and any "more-formal" database connector).


Using your added Dates-including frame, I'll modify this code in two ways:

  • Dates instead of search (though I still use search_m3 as my temporary variable); and
  • left-open comparison, so when looking at (say) 10:00:00, we include times 08:00:00, 09:00:00, and 10:00:00, but notably not 07:00:00; feel free to play with >/>=/</<= for your preferences of left/right open/closed-ness.
data2b <- structure(list(Dates = structure(c(1613361600, 1613383200, 1613408400), class = c("POSIXct", "POSIXt"), tzone = "GMT")), class = "data.frame", row.names = c(NA, -3L))
data2b %>%
  mutate(id = row_number(), search_m3 = Dates - 3600*3) %>%
  left_join(data1, join_by(search_m3 < Date, Dates >= Date)) %>%
  summarize(avg = mean(Speed, na.rm = TRUE), .by = c(id, Dates))
#   id               Dates       avg
# 1  1 2021-02-15 04:00:00  7.333333
# 2  2 2021-02-15 10:00:00  7.500000
# 3  3 2021-02-15 17:00:00 24.333333
r2evans
  • 141,215
  • 6
  • 77
  • 149