1

it's my first time posting here! I am super stuck on what I'm sure is an easy thing to do.

I have a dataframe of irregular intervals and a vector of dates. If one of the dates occurs within any of the given intervals, I would like a new column to flag this (as the intervals need to be deleted). Similar to this post but the solution doesn't work due to the irregular intervals. I have over 2000 intervals and 2000 dates.

I can get the dates that occur within the given intervals using the %within% function, but this is no good as I cant find which intervals the dates are occurring in.

I've tried the solutions in this similar post but I have no grouping variable and can't get them to work.

Any suggestions would be incredibly helpful!!! Thank you so much!!

Example raw data (not as lubridate intervals):

>df1
 diveno               start                 fin
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20

>df2
                 date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00

What I'd like
> df3
  diveno               start                 fin dateoccurs
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N

Where the dateoccurs column flags if a date from df2 occurs in given intervals in df1

Code for example data:


df1<-data.frame(diveno=c(1,2,3,4,5), 
           start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
           fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))

df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")


df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")

What I need:


df3<-data.frame(diveno=c(1,2,3,4,5), 
                start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
                fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
                dateoccurs=c("Y","N","N","Y","N"))

The closest I've gotten is using an answer from this post But it returns altered 'fin' times, and when applied to the real massive dataset seems to duplicate values and change the number of 'diveno'!

intervals<-df1
elements<-df2[,1]

library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)

UPDATED sample data for df2 that seems to produce false positives?

> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600,  1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400,  1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200,  1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000,  1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800,  1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600,  1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400,  1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200,  1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct",  "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_)), class = "data.frame", row.names = c(NA, 
-48L))

Update 2 (sorry!) within df2 the following times:

30 2018-08-01 06:25:00    NA
31 2018-08-01 06:30:00    NA
32 2018-08-01 06:35:00    NA

Seem to be recognised as falling within the following intervals:

   diveno               start                 fin dateoccurs
1        1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2        2 2018-08-01 08:40:50 2018-08-01 08:53:40          Y
3        3 2018-08-01 10:01:00 2018-08-01 10:16:30          N

Why might this be happening?

s<-df1[1,2] f<-df1[1,3]     int<-interval(s,f)

df2[,1] %within% ir
Jess
  • 11
  • 2

2 Answers2

1

data.table solution using a non-equi join

library(data.table)
setDT(df1); setDT(df2)
# initialise new column with "N"
df1[, dateoccurs := "N"]
# update join
df1[df2, dateoccurs := "Y", on = .(start <= date, fin >= date)][]
#    diveno               start                 fin dateoccurs
# 1:      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
# 2:      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
# 3:      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
# 4:      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
# 5:      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thank you so so much!! This is PERFECT!! Just one problem though... When I run with the real data it's flagging diveno 1 and 2 as Y, but there are no values within the range of diveno 1 and 2? I've added a filtered subsection of the dates from df2 on the 2018-08-01. Do you know why it might be recognising this? diveno start fin dateoccurs 1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y 2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 Y – Jess Dec 21 '22 at 13:12
  • I guess it's a timezone issue? check `attributes(df1$start)` to `attributes(df2$date)`. `tzone` should match... if not, you can use `lubridate::force_tz()` to change/set a timezone without altering the timestamps. – Wimpel Dec 21 '22 at 13:40
  • YOU ARE INCREDIBLE!!!!! Honestly thank you so so much, you have no idea how much you've helped me, I've been stuck on this whole problem for ages, thank you!! – Jess Dec 21 '22 at 20:40
0

You may use outer.

fun <- function(i, j) data.table::between(df2[i, 'date'], df1[j, 'start'], df1[j, 'fin'])
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun)))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   1
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   0
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   0
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   1
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   0

The binary column can easily be wrapped as factor if you like.

df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun))) |> 
  factor(labels=c("N", "Y"))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   Y
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   N
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   N
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   Y
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   N

Data:

df1 <- structure(list(diveno = c(1, 2, 3, 4, 5), start = structure(c(1533104440, 
1533105650, 1533110460, 1533131130, 1533135960), class = c("POSIXct", 
"POSIXt"), tzone = "CET"), fin = structure(c(1533105560, 1533106420, 
1533111390, 1533131900, 1533136700), class = c("POSIXct", "POSIXt"
), tzone = "CET"), occ = structure(c(2L, 1L, 1L, 2L, 1L), levels = c("N", 
"Y"), class = "factor")), row.names = c(NA, -5L), class = "data.frame")

df2 <- structure(list(date = structure(c(1533105000, 1533131220, 1533222600
), class = c("POSIXct", "POSIXt"), tzone = "CET")), row.names = c(NA, 
-3L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    This is wonderful, thank you so much!! Unfortunately this seems to be too much for my computer when run on the full dataset but it's so great to see how it can be done, thank you! – Jess Dec 21 '22 at 13:16