0

I have a million+ row tibble of time series with a number of different IDs and tens of thousands of datapoints per ID.

timeseries <- tibble(ID = c(101, 101, 101, 101, 101), 
                     time = c(1,2,3,4,5), 
                     block = c(0,0,0,0,0))

I have another tibble of some thousands of rows that contains start and end times of events, different for each ID, that should be marked on the time series (so that they can be easily summarized with summarize()). There are empty timepoints between events, and at the start and beginning of each ID's timeseries.

blocks <- tibble(ID = c(101, 101), 
                 block = c(1, 2), 
                 st = c(1, 4), 
                 end = c(2,5))

How to do this most easily and quickly?

My current solution is horribly slow and clunky:

j <- 1
  for(i in 1:nrow(blocks)){
    checkrow <- blocks[i,]
    while(timeseries[j, "ID"] < checkrow["ID"]) j = j+1   # skip wrong ID
    while(timeseries[j, "time"] < checkrow["st"]) j = j+1 # skip timepoints until start
    while(timeseries[j, "time"] < checkrow["end"]){
      timeseries[j, "block"] <- checkrow["block"]         # mark timepoints until end
      j = j+1
    }
    next  # move to next block
  }

I don't have the start and end points in the time series with NAs between and don't know how to do that, so this and this solution doesn't help.

I'd like to stay within tidyverse and vector logic instead of loops but don't know how. I looked at map() but couldn't figure out how to do this. I'm sure I'm missing some simple answer.


edit: So, I made a better version when I just wasn't so tired. Using the base r operations instead of while loops was much, much faster. First pivoted the blocks to long format, then made an empty timeseries_complete, and then:

for(j in (blocks %>% select(ID) %>% unique() %>% pull)){
  ts <- timeseries %>% 
    filter(ID == j) %>% 
    mutate(trig = NA_integer_)
     
  for(i in 1:(nrow( blocks %>% filter(ID == j) )-1)){
    ts[ts$time > blocks[[i, "Time"]] & ts$time < blocks[[i+1, "Time"]], "bl_nr"] <- ts[[i, "block_nr"]]
  }
  
  timeseries_complete <- timeseries_complete %>% add_case(ts)
  }

This effectively solved the practical problem, but I'd still like to know a tidyverse version.

RandomMonitor
  • 439
  • 1
  • 8
  • 16

1 Answers1

1

I am not entirely sure if I understand what you are trying to achieve. Does this help?

library(powerjoin)
library(dplyr)
timeseries |> 
  select(-block) |> 
  left_join(blocks |> select(-end), by = c("ID", "time" = "st")) |> 
  power_left_join(blocks |> select(-st), by = c("ID", "time" = "end"), conflict = coalesce_xy)

# A tibble: 5 × 3
     ID  time block
  <dbl> <dbl> <dbl>
1   101     1     1
2   101     2     1
3   101     3    NA
4   101     4     2
5   101     5     2
Julian
  • 6,586
  • 2
  • 9
  • 33
  • Yes, this is definitely something like what I need! Unfortunately it doesn't work right away. The actual times are not discrete and they originate from two different systems, so they are more like 2.4534 seconds and 2.4552 seconds, so the left_join doesn't work. I looked at the powerjoin, and it has a fuzzy join which appears to offer a solution. However, even with a reduced dataset (got rid of timepoints not part of any block) I ran into this: > power_left_join(., blocks %>% select(ID, st, block), by = c("ID", ~.x$time > .y$st)) Error: cannot allocate vector of size 41.0 Gb – RandomMonitor May 19 '23 at 19:38
  • You may Look Into data.table or arrow for speed. – Julian May 20 '23 at 10:56
  • You might round the times in new columns before the join and use those as keys – moodymudskipper Jul 04 '23 at 13:50