POSIX Vector Comparison -- searching through and finding a match between DATE vectors efficiently?

Question

I have 5five POSIXct type vectors. ptime vector is the reference vector. I want to find matching dates between ptime and the rest of the vectors. Once a date is matched then I want to perform a time comparison. A time comparison is followed and the the results are populated in a data.frame(test) with an appropriate classifying number.

# create the reference and the other vectors 
ptime <- sample(seq(as.POSIXct('2005-08-01'),as.POSIXct('2006-05-31'), by='hour'),1051)
dawn <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),by='hour'),1095)
sunrise <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),by='hour'),1095)
sunset <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),by='hour'),1095)
dusk <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),by='hour'),1095)

# extract the date to compare using only the `dawn` vector
# all other vectors (except ptime) have the same date and length
pt <- as.Date(ptime)
dw <- as.Date(dawn)

# create data.frame
time <- c(1:1051)
test<-data.frame(time)

# I use a data.frame because I want to re-populate an existing data.frame
> str(test)
'data.frame':   1051 obs. of  1 variable:
 $ time: int  1 2 3 4 5 6 7 8 9 10 ...

# this is the loop that matches and assigns
for( b in 1:length(ptime) ){
    for( a in 1:length(dawn) ) {
      if( dw[a] == pt[b] ){
            if( ptime[b] < dawn[a] ) {
                test$time[b] <- 1
            }else if( ptime[b] < sunrise[a] ) {
                test$time[b] <- 2
            }else if( ptime[b] < sunset[a] ) {
                test$time[b] <- 3
            }else if( ptime[b] < dusk[a] ) {
                test$time[b] <- 4
            }else
                test$time[b] <- 1
        }
    }
}

# output result shows the categorization sequence of 1, 2, 3, and 4
> head(test)
  time
1    1
2    1
3    3
4    1
5    1
6    3

The above code accomplishes what I want to do... but it takes 98.58 seconds. I have more data that varies in length (up to 5000).

Since I am a newbie to this, my guess is... what is taking so much time is the comparison of the DATES. Every time a new comparison has to be made dw[a] == pt[b] the process must search through dw[a]. Also, are the if-else statements necessary to accomplish the task?

Can anyone provide a faster/more efficient method to loop through, find matches, and store the results? Greatly appreciate it. Thanks

Your code is not reproducible, in other words I get errors when I run it on my machine. Please make your example reproducible. — Andrie, Aug 27 '11 at 15:00
Please try again. To ensure your example is reproducible, start from a clean R session and try to run your code. — Andrie, Aug 27 '11 at 18:39

joran · Answer 1 · 2011-08-27T20:33:34.050

3

Edited based on OP's updates

What follows is still mainly guesswork on my part. I fixed some typos in your edit to get this:

ptime <- sample(seq(as.POSIXct('2005-08-01'),as.POSIXct('2006-05-31'), 
                by='hour'),1051)
dawn <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),
                by='hour'),1095)
sunrise <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),
                by='hour'),1095)
sunset <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),
                by='hour'),1095)
dusk <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31'),
                by='hour'),1095)

# extract the date to compare using only the `dawn` vector
# all other vectors (except ptime) have the same date and length
pt <- as.Date(ptime)
dw <- as.Date(dawn)

# create data.frame
time <- c(1:1051)
test<-data.frame(time)

Here's my wild stab at this:

tmp <- outer(pt, dw, "==")
tmp[upper.tri(tmp)] <- NA
tmp <- which(tmp,arr.ind = TRUE)

test$time[ tmp[ ptime[ tmp[,1] ] < dawn[ tmp[,2] ],1] ] <- 1
test$time[ tmp[ ptime[ tmp[,1] ] < sunrise[ tmp[,2] ],1 ] ] <- 2
test$time[ tmp[ ptime[ tmp[,1] ] < sunset[ tmp[,2] ],1 ] ] <- 3
test$time[ tmp[ ptime[ tmp[,1] ] < dusk[ tmp[,2] ],1] ] <- 4

That's some ugly, ugly subset indexing going on there. Ugly enough that I'm convinced there has to be a better way to organize your data to avoid this. It's also obscure enough that I'm not sure I can clearly explain what's going on, but I think this is doing what you describe.

edited Aug 27 '11 at 20:33

answered Aug 27 '11 at 00:51

joran

169,992
32
429
468

Sorry for not posting reproducible data but it seems like you have done it quite simply. The only thing I would of done differently would of been to offset the dates `x <- sample(seq(as.POSIXct('2000-08-01'),as.POSIXct('2003-12-31'),by = "hour"),1000)` and `y <- sample(seq(as.POSIXct('2000-01-01'),as.POSIXct('2005-12-31'),by = "hour"),5000)` and create different lengths. The resulting data is stored into a dataframe `test$time`, what other 'convenient' form would you recommend? I'm still learning and I greatly appreciate you sharing your knowledge – wisfool Aug 27 '11 at 01:43
Also... sunrise, sunset, etc. are POSIXct date/time vectors as well with the same DATE as `dawn` – wisfool Aug 27 '11 at 02:11
joran... running the `if-else` statement you posted throws an error... `In if (x[d[, 1]] < y[d[, 2]]) { : the condition has length > 1 and only the first element will be used` – wisfool Aug 27 '11 at 02:23
@wisfool That's why I said the code I provided was only a _sketch_. If you want more help, you'll have to edit your question to include a self-contained, reproducible example. See [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for some advice on how to do this. – joran Aug 27 '11 at 02:40
@joranThe sample data you provided is exactly the same type of data I am working with `POSIXct` except the years are 2005-2007 for `x` and 2005-2006 for `y` with the above given lengths. `x <- sample(seq(as.POSIXct( '2005-01- 01'), as.POSIXct('2007-12-31')),1095)`, `y <- sample(seq (as.POSIXct('2005-01-01'),as.POSIXct('2006-16-31'),1051)`, `z <- sample(seq(as.POSIXct( '2005-01- 01'), as.POSIXct('2007-12-31')) ,1095)`. Everything you have done is in direct comparison to my data. You reproduced the problem I was having problems with. – wisfool Aug 27 '11 at 03:00
@joran...continued... I don't understand what you mean by 'reproducible example' when you reproduced exactly what I have (with minor details). Your method is extremely fast (0.96 !!) !! I had to adjust the `if-else` statement due to the above mentioned error but other than that I thank you greatly for your help. I'll be dissecting your method for days until I understand the logic. Thanks again! – wisfool Aug 27 '11 at 03:05
@joran... basically what this is trying to do is take a given event with a timestamp (`ptime`) and label it when it occurred i.e. dawn, sunrise, sunset, dusk, or night. The data for dawn, sunrise, etc. comprises of a starting timestamp, when it happens (for example, dawn happens at '2005-01-01 04:25:00'). In order to label `ptime` i first match dates of occurrence. Then I check the time of `ptime` and compare it with the other times to label it appropriately. If `ptime` happened before 'dawn', then `ptime` is labeled night, the `ptime` event was recorded during the night. – wisfool Aug 27 '11 at 21:55
1

@joran... Thank you for your attention and help. With the help of your code I was able to accomplish the task extremely quickly... this is what I have... `d <- which(outer(as.Date(ptime, tz='MST'),as.Date(dawn, tz='MST'),"=="),arr.ind = TRUE) test$time <- ifelse( (ptime[d[,1]] < dawn[d[,2]]) | (ptime[d[,1]] > dusk[d[,2]]), 1, ifelse( ptime[d[,1]] < sunrise[d[,2]], 2, ifelse( ptime[d[,1]] < sunset[d[,2]], 3, 4 ) ) )` – wisfool Aug 27 '11 at 21:58
@wisfool: please post this solution as an answer. It's OK, in fact encouraged, to answer your own question if you come up with a good solution. (Much easier for future readers to find than something at the bottom of a long comment thread ...) – Ben Bolker Aug 28 '11 at 02:13

score 2 · Answer 2 · answered Aug 28 '11 at 02:51

Real fast solution

ptime <- sample(seq(as.POSIXct('2005-08-01'),as.POSIXct('2006-05-31'), by='hour'),1051)
dawn <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31')),1095)
sunrise <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31')),1095)
sunset <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31')),1095)
dusk <- sample(seq(as.POSIXct('2005-01-01'),as.POSIXct('2007-12-31')),1095)

time <- c(1:1051)
test<-data.frame(time)

# From joran
#creates a matrix that lists the IDs that match each other
d <- which(outer(as.Date(ptime, tz='MST'),as.Date(dawn, tz='MST'),"=="),arr.ind = TRUE)

>head(d)
     row col
[1,]  86 213
[2,] 226 213
[3,] 346 213
[4,] 492 214
[5,] 272 215

#This `ifelse` handles multivalued vectors
test$time <- ifelse( (ptime[d[,1]] < dawn[d[,2]]) | (ptime[d[,1]] > dusk[d[,2]]), 1, 
             ifelse(ptime[d[,1]] < sunrise[d[,2]], 2, 
             ifelse( ptime[d[,1]] < sunset[d[,2]], 3, 4 ) ) )

Thanks to joran this runs at 0.00 per my machine. Vectorization is the key.

POSIX Vector Comparison -- searching through and finding a match between DATE vectors efficiently?

2 Answers2