0

I need to join two datasets together:

  1. time series data taken every second with a heart rate measure (with 1.4 million records)
>allsecsHR
timestamp               HRbpm
2023-03-22 09:04:53     101
2023-03-22 09:04:54     124
2023-03-22 09:04:55     103
2023-03-22 09:04:56     111
2023-03-22 09:04:57     112
2023-03-22 09:04:58     143
2023-03-22 09:04:59     109
2023-03-22 09:05:00     129
2023-03-22 09:05:01     122
2023-03-22 09:05:02     125
2023-03-22 09:05:03     110
  1. behavioural records with a start and end timestamp
>bhr
Behaviour  BhrTimeStart         BhrTimeEnd
Forage     2023-03-22 09:04:53  2023-03-22 09:04:58
Vigilance  2023-03-22 09:04:58  2023-03-22 09:05:03
Forage     2023-03-22 09:05:03  2023-03-22 09:05:10

At the end I would like a dataset that has each row representing one second, with the behaviour that was being performed per second

I have tried to do this in mySQL and in R using sqldf and powerjoin but it keeps losing connection to the sql server after running for 2 hours or in R it does not complete and crashes or returns Error: vector memory exhausted (limit reached?). I would really appreciate some help to find an efficient way to do this!

I think a data.table solution might be the fastest but I am not sure how to do it, a tidy solution would also be great!

So far I have tried:

mySQL/sqldf

library(sqldf)
sqldf("select * from allsecsHR
              left join bhr
              on allsecsHR.timestamp between bhr.BhrTimeStart and bhr.BhrTimeEnd")

powerjoin

library(powerjoin)
power_left_join(
  allsecsHR, bhr, 
  by = ~.x$timestamp > .y$BhrTimeStart & 
    (.x$timestamp < .y$BhrTimeEnd | is.na(.y$BhrTimeEnd)),
  keep = "left")

The output table I would like is this:

timestamp           HRbpm   Bhr         BhrTimeStart        BhrTimeEnd
2023-03-22 09:04:53 101     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:54 124     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:55 103     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:56 111     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:57 112     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:58 143     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:04:59 109     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:00 129     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:01 122     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:02 125     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:03 110     Forage      2023-03-22 09:05:03   2023-03-22 09:05:10
isabeld
  • 13
  • 2
  • Seems similar to https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions. Check out the proposed solutions there. – MrFlick Aug 03 '23 at 14:03
  • 3
    (1) Please share unambiguous data using `dput(x)`; it's possible to copy from the web interface, but the embedded spaces makes it a bit more difficult to use quickly. See https://stackoverflow.com/q/5963269 for demos of `dput`, `data.frame`, and `read.table` for this purpose. (2) "Join on a range" is natively supported in `data.table` (you tagged this, no obvious use of it) and `dplyr` (version 1.1.0 and newer) and `sqldf` (always) and `fuzzyjoin` (always). – r2evans Aug 03 '23 at 14:04
  • thanks for the suggestion, but I do not have a matching index value across the two tables so need to join it on timestamp range alone – isabeld Aug 03 '23 at 14:07
  • You can probably just expand the start/end data frame using `sequence`, I'll attempt a solution. – NicChr Aug 03 '23 at 14:14
  • How come timestamp 2023-03-22 09:04:57 has HRbpm of 112? How can this be inferred from the input data? – s_baldur Aug 03 '23 at 14:26

3 Answers3

1

If all timestamps in allsecsHR have a corresponding interval in bhr:

library(data.table)

setDT(bhr)[setDT(allsecsHR)[, ts := timestamp], on = .(BhrTimeStart == ts), roll = TRUE]
#>           Bhr        BhrTimeStart          BhrTimeEnd           timestamp HRbpm
#>  1:    Forage 2023-03-22 09:04:53 2023-03-22 09:04:58 2023-03-22 09:04:53   101
#>  2:    Forage 2023-03-22 09:04:54 2023-03-22 09:04:58 2023-03-22 09:04:54   124
#>  3:    Forage 2023-03-22 09:04:55 2023-03-22 09:04:58 2023-03-22 09:04:55   103
#>  4:    Forage 2023-03-22 09:04:56 2023-03-22 09:04:58 2023-03-22 09:04:56   111
#>  5:    Forage 2023-03-22 09:04:57 2023-03-22 09:04:58 2023-03-22 09:04:57   112
#>  6: Vigilance 2023-03-22 09:04:58 2023-03-22 09:05:03 2023-03-22 09:04:58   143
#>  7: Vigilance 2023-03-22 09:04:59 2023-03-22 09:05:03 2023-03-22 09:04:59   109
#>  8: Vigilance 2023-03-22 09:05:00 2023-03-22 09:05:03 2023-03-22 09:05:00   129
#>  9: Vigilance 2023-03-22 09:05:01 2023-03-22 09:05:03 2023-03-22 09:05:01   122
#> 10: Vigilance 2023-03-22 09:05:02 2023-03-22 09:05:03 2023-03-22 09:05:02   125
#> 11:    Forage 2023-03-22 09:05:03 2023-03-22 09:05:10 2023-03-22 09:05:03   110

If not all timestamps in allsecsHR have a corresponding interval in bhr, a non-equi join will work:

setDT(bhr)[setDT(allsecsHR)[, ts := timestamp], on = .(BhrTimeStart <= ts, BhrTimeEnd > ts)]

Data:

allsecsHR <- structure(list(timestamp = structure(c(1679490293, 1679490294, 
1679490295, 1679490296, 1679490297, 1679490298, 1679490299, 1679490300, 
1679490301, 1679490302, 1679490303), class = c("POSIXct", "POSIXt"
), tzone = ""), HRbpm = c(101L, 124L, 103L, 111L, 112L, 143L, 
109L, 129L, 122L, 125L, 110L)), row.names = c(NA, -11L), class = c("data.table", "data.frame"))

bhr <- structure(list(Bhr = c("Forage", "Vigilance", "Forage"), BhrTimeStart = structure(c(1679490293, 
1679490298, 1679490303), class = c("POSIXct", "POSIXt"), tzone = ""), 
    BhrTimeEnd = structure(c(1679490298, 1679490303, 1679490310
    ), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"))
jblood94
  • 10,340
  • 1
  • 10
  • 15
  • Thanks so much for the suggestion, but there are gaps in the bhr dataset as there are multiple recording periods – isabeld Aug 03 '23 at 14:58
  • I've updated the answer to clarify that by "gaps", I mean no timestamps in `allsecsHR` that do not have a corresponding interval in `bhr`. – jblood94 Aug 03 '23 at 15:01
  • thanks for the clarification, but I think that it will still not work as there are some timestamps in `allsecsHR` that do not have a corresponding interval in `bhr` as there were not behavioural observations for the whole time that heart rate was recorded – isabeld Aug 03 '23 at 15:10
  • In that case, you can use a non-equi join. See the updated answer. – jblood94 Aug 03 '23 at 20:24
  • thanks, I tried it but got the error `unused arguments` for the `on` part of the script? – isabeld Aug 04 '23 at 11:43
  • Are `allsecsHR` and `bhr` both `data.table` objects? Can you provide the full error message? – jblood94 Aug 04 '23 at 11:55
  • 1
    Ah yes, that was the problem! It works now :) the only issue is now that the `BhrTimeStart` and `BhrTimeEnd` column values are the same as the `timestamp` values rather than the original values. Not a huge problem, but makes it more difficult to check if the join has worked correctly – isabeld Aug 04 '23 at 12:24
  • Good. Sorry for my poor assumption based on the `data.table` tag in your question. I've updated the answer to include `setDT`. – jblood94 Aug 04 '23 at 12:28
0

There's likely a better solution without expansion but would something like this work?

library(tidyverse)
library(lubridate)

df1 <- tibble(timestamp = 
                ymd_hms("2023-03-22 09:04:53") + 
                dseconds(0:3),
              HRbpm = c(101, 124, 103, 111))
df1                          

df2 <- tibble(Behaviour = c("Forage", "Vigilance", "Forage"),
              BhrTimeStart = ymd_hms(c("2023-03-22 09:04:53",
                                       "2023-03-22 09:04:58",
                                       "2023-03-22 09:05:03")),
              BhrTimeEnd = ymd_hms(c("2023-03-22 09:04:58",
                                     "2023-03-22 09:05:03",
                                     "2023-03-22 09:05:10")))

df2$size <- as.double(df2$BhrTimeEnd - df2$BhrTimeStart) + 1

df3 <- tibble(behaviour = rep(df2$Behaviour, df2$size),
              timestamp = as_datetime(
                sequence(
                  df2$size, 
                  from = as.double(df2$BhrTimeStart), 
                  by = 1
                )
              ))

df3$BhrTimeStart <- rep(df2$BhrTimeStart, df2$size)
df3$BhrTimeEnd <- rep(df2$BhrTimeEnd, df2$size)

df1 %>%
  left_join(df3, by = "timestamp")

You can replace df1 and df2 with your datasets.

NicChr
  • 858
  • 1
  • 9
  • thanks so much for the suggestion! I am trying it but am getting this error "Error in as.POSIXct(origin, tz = "GMT", ...) : object 'origin' not found" when creating the df3 tibble. I am not quite sure on the issue here? – isabeld Aug 03 '23 at 14:52
  • Origin should be part of the `lubridate` package, I've updated the code which should hopefully work. – NicChr Aug 03 '23 at 18:27
  • thanks for the update, I tried it but ended up with a lot of duplicates in the `timestamp` column – isabeld Aug 04 '23 at 11:44
0

That might work:

library(dplyr)
library(lubridate)



# Sample data for allsecsHR
allsecsHR <- data.frame(
  timestamp = ymd_hms(c("2023-03-22 09:04:53", "2023-03-22 09:04:54", "2023-03-22 09:04:55",
                        "2023-03-22 09:04:56", "2023-03-22 09:04:57", "2023-03-22 09:04:58",
                        "2023-03-22 09:04:59", "2023-03-22 09:05:00", "2023-03-22 09:05:01",
                        "2023-03-22 09:05:02", "2023-03-22 09:05:03", "2023-07-22 09:05:03"
                        )),
  HRbpm = c(101, 124, 103, 111, 112, 143, 109, 129, 122, 125, 110, 202)
)


# Sample data for bhr
bhr <- data.frame(
  Behaviour = c("Forage", "Vigilance", "Keks"),
  BhrTimeStart = ymd_hms(c("2023-03-22 09:04:53", "2023-03-22 09:04:58", "2023-03-22 09:05:03")),
  BhrTimeEnd = ymd_hms(c("2023-03-22 09:04:58", "2023-03-22 09:05:03", "2023-03-22 09:05:10"))
)


# Function to check if timestamp is within BhrTimeStart and BhrTimeEnd
check_within_interval <- function(timestamp) {
  idx <- which(timestamp >= bhr$BhrTimeStart & timestamp < bhr$BhrTimeEnd)
  if (length(idx) > 0) {
    return(bhr$Behaviour[idx[1]])
  } else {
    return(NA)
  }
}

# Adding a new column 'Behaviour' to allsecsHR based on the condition
allsecsHR$Behaviour <- sapply(allsecsHR$timestamp, check_within_interval)

print(allsecsHR)

Gives me the following output:

> print(allsecsHR)
             timestamp HRbpm Behaviour
1  2023-03-22 09:04:53   101    Forage
2  2023-03-22 09:04:54   124    Forage
3  2023-03-22 09:04:55   103    Forage
4  2023-03-22 09:04:56   111    Forage
5  2023-03-22 09:04:57   112    Forage
6  2023-03-22 09:04:58   143 Vigilance
7  2023-03-22 09:04:59   109 Vigilance
8  2023-03-22 09:05:00   129 Vigilance
9  2023-03-22 09:05:01   122 Vigilance
10 2023-03-22 09:05:02   125 Vigilance
11 2023-03-22 09:05:03   110      Keks
12 2023-07-22 09:05:03   202      <NA>

Please note that I added a date for allsecsHR that is not represented in bhr to show handling of NA. And that Keks is another behaviour to show handling of < values.

Sascha
  • 194
  • 7