1

After reading quite a few posts about this and trying the solutions for my data I still do not get the desired result. Basically I have a data.frame with two time columns: TimeStamp and StartTime and one column that acts as an identifier for the event: ID. I want to create a third column where if ID == -999 (representing non-event) then TimeStamp column is used otherwise StartTime should be used.

To make a bit more complicated: so, the data represents 'blocks' of events. Between the events there is time where 'nothing' happens i.e. ID == -999. In that case I want the column to be filled by the first element of that block.

Here is oversimplified version of my data:

ID <- rep(c(84,-999,88),c(3,3,4))
f <- as.POSIXct("09:55:45", format = "%H:%M:%S")
t <- as.POSIXct("10:05:45", format = "%H:%M:%S")
TimeStamp <- seq.POSIXt(from = f, to = t, length.out = 10)
StartTime <- as.POSIXct(rep(c("09:54:12",NA,"10:02:25"),c(3,3,4)), format = "%H:%M:%S")
df <- data.frame(TimeStamp,StartTime,ID)

This is the result I would like to have:

> df[,"Time"] <- rep(c("09:54","09:59","10:02"), c(3,3,4))
> df
#             TimeStamp           StartTime   ID  Time
#1  2015-04-13 09:55:45 2015-04-13 09:54:12   84 09:54
#2  2015-04-13 09:56:51 2015-04-13 09:54:12   84 09:54
#3  2015-04-13 09:57:58 2015-04-13 09:54:12   84 09:54
#4  2015-04-13 09:59:05                <NA> -999 09:59
#5  2015-04-13 10:00:11                <NA> -999 09:59
#6  2015-04-13 10:01:18                <NA> -999 09:59
#7  2015-04-13 10:02:25 2015-04-13 10:02:25   88 10:02
#8  2015-04-13 10:03:31 2015-04-13 10:02:25   88 10:02
#9  2015-04-13 10:04:38 2015-04-13 10:02:25   88 10:02
#10 2015-04-13 10:05:45 2015-04-13 10:02:25   88 10:02

I have tried things like ifelse and sapply. Its not really working. The solution I have up till now is by extracting all the non-events (ID == -999). Then with another ID variable that identifies all the unique events/non-events aggregate to find the first TimeStamp with min function. Then I have two different Time columns which I join using the solution in this post. It works, but I think there is a more elegant and direct way to do this. But being a R-newbie I can't solve it quite yet.

Any suggestions?

BTW: I hope its a bit clear, please tell me if its not.

EDIT: I don't think my question is a duplicate, because those answers did not work in my case. It doesn't solve the added complication I tried to explain above.

Community
  • 1
  • 1
tstev
  • 607
  • 1
  • 10
  • 20
  • 1
    So to be clear, if there a block of `-999` at the end of this data set, you want to consider it as a different block from the first one? If so, I think it would be worth to add this at the end so it will be clearer. – David Arenburg Apr 13 '15 at 09:59
  • Yes you are right that would make it more clear. Thanks for the suggestion. Another block of `-999` is indeed a separate event. – tstev Apr 13 '15 at 11:15

1 Answers1

4

Here's a possible solution using the data.table v >= 1.9.5 (I've added another -999 event at the end of the data set as I understand you want it to be treated differently).

Basically I'm just creating a new index using the new rleid function (and inserting it directly into the by statement) and then setting a simple if else statement

library(data.table)
setDT(df)[, Time := if(anyNA(StartTime)) {
                       format(TimeStamp[1L], "%H:%M") 
                      } else {
                       format(StartTime[1L], "%H:%M")
                      },
            by = rleid(ID)][]

#               TimeStamp           StartTime   ID  Time
#  1: 2015-04-13 09:55:45 2015-04-13 09:54:12   84 09:54
#  2: 2015-04-13 09:56:49 2015-04-13 09:54:12   84 09:54
#  3: 2015-04-13 09:57:54 2015-04-13 09:54:12   84 09:54
#  4: 2015-04-13 09:58:58                <NA> -999 09:58
#  5: 2015-04-13 10:00:03                <NA> -999 09:58
#  6: 2015-04-13 10:01:08                <NA> -999 09:58
#  7: 2015-04-13 10:02:12 2015-04-13 10:02:25   88 10:02
#  8: 2015-04-13 10:03:17 2015-04-13 10:02:25   88 10:02
#  9: 2015-04-13 10:04:21 2015-04-13 10:02:25   88 10:02
# 10: 2015-04-13 10:05:26 2015-04-13 10:02:25   88 10:02
# 11: 2015-04-13 10:06:31                <NA> -999 10:06
# 12: 2015-04-13 10:07:35                <NA> -999 10:06
# 13: 2015-04-13 10:08:40                <NA> -999 10:06
# 14: 2015-04-13 10:09:45                <NA> -999 10:06
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 2
    You can pass `rleid(ID)` directly to `by`. I'd do it as: `setDT(df)[, Time := format(if(anyNA(StartTime)) TimeStamp[1L] else StartTime[1L], "%H:%M"), by=rleid(ID)]` – Arun Apr 13 '15 at 10:35
  • 1
    Or `by=cumsum(c(0,diff(dt$ID))!=0)` if he doesn't want to install 1.9.5. – nicola Apr 13 '15 at 10:45
  • 1
    @nicola, that works fine here, but gets complicated when id should be generated based on more than one column. – Arun Apr 13 '15 at 10:54
  • @david, @nicola -- thanks a lot! it works quite nicely. In the function i built in a check to see which version of `data.table` is installed and then either use David's answer or otherwise nicola's. – tstev Apr 14 '15 at 11:33
  • @David - i have another question that is kind of similar as this. let's say I have an `ID` column: `1,1,1,2,2,1,2,1,1,2,2`. for example. Now I want to create a new column that would results in: `0,0,0,1,1,0,2,0,0,3,3`. So a conditional run-length encoding where I am only interested in when `ID == 2`. – tstev Apr 14 '15 at 11:42
  • 1
    I don't really have time right now, but maybe something like `x <- c(1,1,1,2,2,1,2,1,1,2,2) ; y <- rleid(x) ; y[x != 2] <- 0 ; y/2` – David Arenburg Apr 14 '15 at 11:52
  • @DavidArenburg -- no problem! I had same as you, but it doesn't work for when the first element in `x` is `2`. Then you have to do to `ceiling(y/2)`. I can't seem to get this to work with `data.table` syntax for my orginal dataset. I will keep playing around. Thanks anyway! – tstev Apr 14 '15 at 12:26
  • 1
    I figured it out! I was thinking to difficult :P Cheers! – tstev Apr 14 '15 at 14:43