it's my first time posting here! I am super stuck on what I'm sure is an easy thing to do.
I have a dataframe of irregular intervals and a vector of dates. If one of the dates occurs within any of the given intervals, I would like a new column to flag this (as the intervals need to be deleted). Similar to this post but the solution doesn't work due to the irregular intervals. I have over 2000 intervals and 2000 dates.
I can get the dates that occur within the given intervals using the %within% function, but this is no good as I cant find which intervals the dates are occurring in.
I've tried the solutions in this similar post but I have no grouping variable and can't get them to work.
Any suggestions would be incredibly helpful!!! Thank you so much!!
Example raw data (not as lubridate intervals):
>df1
diveno start fin
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20
>df2
date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00
What I'd like
> df3
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
Where the dateoccurs column flags if a date from df2 occurs in given intervals in df1
Code for example data:
df1<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))
df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")
df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")
What I need:
df3<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
dateoccurs=c("Y","N","N","Y","N"))
The closest I've gotten is using an answer from this post But it returns altered 'fin' times, and when applied to the real massive dataset seems to duplicate values and change the number of 'diveno'!
intervals<-df1
elements<-df2[,1]
library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)
UPDATED sample data for df2 that seems to produce false positives?
> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600, 1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400, 1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200, 1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000, 1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800, 1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600, 1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400, 1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200, 1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct", "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), class = "data.frame", row.names = c(NA,
-48L))
Update 2 (sorry!) within df2 the following times:
30 2018-08-01 06:25:00 NA
31 2018-08-01 06:30:00 NA
32 2018-08-01 06:35:00 NA
Seem to be recognised as falling within the following intervals:
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 Y
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
Why might this be happening?
s<-df1[1,2] f<-df1[1,3] int<-interval(s,f)
df2[,1] %within% ir