7

I'm analysing temporal patterns in a complex data set consisting of several environmental variables as well as activity data from various animal species. These data have been collected by multiple experimental setups, and data from each setup have been stored once per minute. The project has been running for several years now, so my data set is rather large.

The first few lines of one of my datasets look like this:

> head(setup_01)
DateTime                Film_number unused PIR Wheel Temperature LightOld LightDay LightNight LightUV IDnumbers    error mouse shrew vole rat frog rest extra_info odour
1 2015-03-10 12:27:10                  x   0       0       13.40  1471.34    -0.97    1331.29  700.42           no error     0     0    0   0    0    0                1
2 2015-03-10 12:28:10                  x   0       0       13.43  1471.38    -1.07    1291.11  731.32           no error     0     0    0   0    0    0                1
3 2015-03-10 12:29:10                  x   0       0       13.31  1471.24    -1.08    1368.57 1016.02           no error     0     0    0   0    0    0                1

As I want to relate these variables to various natural cycles like sunrise and sunset throughout the seasons, I have made use of the package maptools to calculate sunrise and sunset times

library(maptools)
gpclibPermit()

#set coordinates
crds=c(4.4900,52.1610)

# download the sunrise/sunset/etc data
setup_01$sunrise=sunriset(matrix(crds,nrow=1),dateTime=as.POSIXct(setup_01$DateTime),POSIXct.out=TRUE,direction="sunrise")
setup_01$sunset=sunriset(matrix(crds,nrow=1),dateTime=as.POSIXct(setup_01$DateTime),POSIXct.out=TRUE,direction="sunset")

#create a variable that's 0 except at sunrise, and one that's 0 except at sunset
setup_01$sunrise_act=0
setup_01$sunset_act=0
setup_01[abs(unclass(setup_01[,"DateTime"])-unclass(setup_01[,"sunrise"]$time))<30,]$sunrise_act=1
setup_01[abs(unclass(setup_01[,"DateTime"])-unclass(setup_01[,"sunset"]$time))<30,]$sunset_act=1

As the behaviour of most animals differs, depending on whether it is day or night, I have used sunset/sunrise times to data to calculate a new variable that's 0 at night and 1 at daytime:

#create a variable that's 0 at night and 1 at daytime
setup_01$daytime=0
setup_01[setup_01[,"DateTime"]>setup_01[,"sunrise"]$time & setup_01[,"DateTime"]<setup_01[,"sunset"]$time,]$daytime=1

So far, so good... it's even possible with maptools to use the start of civil/nautical/astronomical dusk and dawn instead of sunrise and sunset.

This, however, is where my problem starts. I want to number all the days in my experiment. And rather than increasing the day counter at midnight, as is usual and easy to do, I want to increase the day counter at sunset (or possibly in future experiments another moveable time of day like sunrise, nautical dusk and dawn,...). As sunset does not happen at the same time every day, it is not - for me - a straightforward problem to solve.

I've only come up with a for-loop, which is not a nice way of doing things. Also, given that I have more than 6 years worth of data points collected once a minute in several setups, I can go sit and watch the tectonic plates move while R runs through a whole bunch of loops like these:

setup_01$day=0
day<-1
for(i in 1:nrow(setup_01)){
    setup_01[i,]$day<-day
    if(setup_01[i,]$sunset_act==1){
        day<-day+1
    }
}

Apart from being ugly and slow, this code has one big problem: it does not deal with missing values. Sometimes, due to equipment failure, data have not been recorded at all for hours or days. If no data have been recorded during a sunset, the above code does not increase the day counter. This means I need to - somehow - incorporate the date/time codes as well. It is easy to create a variable of days since the start of the experiment:

setup_01$daynumber<-as.integer(ceiling(difftime(setup_01$DateTime, setup_01$DateTime[1], units = "days")))

Perhaps these numbers can be used, possibly in conjunction with Heroka's nice rle-algorithm.

I have used dput to make a few months worth of data from one setup, including a few large chunks of missing data, as well as the newly created variables (as described in this post and in Heroka's answer) available here.

I've looked for something better, nicer and especially faster, but have been unable to come up with a nice trick. I've fiddled with subsetting my dataframe, but come to the conclusion that it is probably a silly approach. I've looked at maptools, lubridate, and GeoLight. I've searched Google, Stack Overflow and various books, like Hadley Wickham's fantastic Advanced R. All to no avail. Perhaps I'm missing something obvious though. I'm hoping someone here can help me.

Community
  • 1
  • 1
Yuri Robbers
  • 291
  • 1
  • 10

2 Answers2

3

I came up with a solution on generated 0's and 1's (as you have already generated those), and it works with runlengths.

  #sunset/sunrise is series of 0's and 1's indicating night and daytime, so solution that works for random sequence
#will work for OP's dataset
set.seed(10)
sunset <- c(1,rbinom(20,1,0.5))

#counter needs to be x for sequence of 11111 (day) and 0000(night), and then increase when 0 reappears
#counter starts at 1

#intermediate step: number each half-day
rle_sunset <- rle(sunset)
period <- rep(1:length(rle_sunset$lengths),rle_sunset$lengths)
#calculate day so that each two subsequent periods are one day

day <- ceiling(period/2)

> cbind(sunset,period,day)
      sunset period day
 [1,]      1      1   1
 [2,]      1      1   1
 [3,]      0      2   1
 [4,]      0      2   1
 [5,]      1      3   2
 [6,]      0      4   2
 [7,]      0      4   2
 [8,]      0      4   2
 [9,]      0      4   2
[10,]      1      5   3
[11,]      0      6   3
[12,]      1      7   4
[13,]      1      7   4
[14,]      0      8   4
[15,]      1      9   5
[16,]      0     10   5
[17,]      0     10   5
[18,]      0     10   5
[19,]      0     10   5
[20,]      0     10   5
[21,]      1     11   6
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Thank you, Heroka! This does exactly what I need it to do, and is virtually instantaneous too... Thanks again! – Yuri Robbers Aug 11 '15 at 13:11
  • One additional niggly-naggly thing though, as I may have been too fast accepting this answer as perfect: this works absolutely fine if there are no missing data. As soon as missing data include t least one instance of a "1" for sunset, things go wrong... unfortunately I have some of those instances. I failed to mention this in the original post. My apologies. – Yuri Robbers Aug 11 '15 at 13:24
  • Can you edit your post to accomodate missing data? How are they generated? (Do you have missing dates?) Maybe you can provide a bit more of the (relevant!) data using dput? – Heroka Aug 11 '15 at 15:09
  • Thank you, Heroka. I have edited my post accordingly, and provided a link to a relevant subset of my data, provided with `dput`. – Yuri Robbers Aug 12 '15 at 12:40
1

I prefer a solution based on precalculated tables. This is slower but I find it clearer to understand. Then I use dplyr to arrange the info I need.

Let me show what I mean. For the sake of an example I create a list of sunset times. Of course you will need to calculate the actual ones.

library(dplyr)
n.obs=1000
set.seed(10)
t0 <- as.POSIXct('2015-03-08 18:00:00')
artificial.sunsets <- data.frame(num.day= seq(0,n.obs+35)) %>% mutate(sunset=cumsum(rlnorm(length(num.day))*30)+t0 + 24*3600*num.day)

artificial.sunsets contains the day number and exact time of sunset but may also include more information about the day.

And some artificial data:

t0 <- as.POSIXct('2015-03-10 12:27:10')
test.data <- data.frame(DateTime=t0+ seq(0, n.obs*24*3600, by=3600), observation=rnorm(24*n.obs+1))

Then one can find the previous sunset using:

find.sunset.before <- function(x){
  cbind(x,artificial.sunsets %>% filter(sunset < x$DateTime) %>% tail(.,n=1))
}

data.with.sunset=test.data %>% rowwise() %>% do(find.sunset.before(.)) %>% ungroup()%>% mutate(rel.time = DateTime-sunset)
head(data.with.sunset)

The resulting table would then contain three more columns 1) the corresponding day number 2) the corresponding sunset time, and 3) the time after sunset.

This should be robust against missing measurements as the day numbering occurs in another table. You can also easily modify the algorithm to use different times and even apply several.

Update

all this can be done much quicker using data.table:

library(data.table)
dt1 <- data.table(artificial.sunsets)
dt2 <- data.table(test.data)

dt1[,DateTime:=sunset]

setkey(dt1, DateTime)
setkey(dt2, DateTime)

r <- dt1[dt2,roll=TRUE]
r[,time.diff:=DateTime-sunset]

I tried timing it with system.time for 1000 observations - the previous takes about 1m, the data.table solution is 0.011s.

bdecaf
  • 4,652
  • 23
  • 44
  • Thank you, bdecaf. This is an elegant, easy to understand solution, and it works. The only drawback is its speed: it still takes over half an hour on the test dataset I've made available in my original post, which means it'll probably run for about 8-10 days on my full dataset. – Yuri Robbers Aug 20 '15 at 11:38
  • 1
    i see - it really shouldn't take this long. in this solution the find.sunset.before is the slowest part. I found this interesting question which might be related: http://stackoverflow.com/questions/11341557/join-data-table-on-exact-date-or-if-not-the-case-on-the-nearest-less-than-date and add it to the answer. This one is near instantanous on my notebook (which is really slow) – bdecaf Aug 21 '15 at 14:11
  • Thanks, @bdecaf. That looks really promising. Your timing of the old version doesn't seem far off of my results. Mind that I have 1440 observations per day, for about 6 years in 8 setups. That'd bring it to nearly two weeks. Your `data.table` version should bring this down to only a few minutes. I'm currently away for the weekend (I'm writing this on my phone), but I'll try the `data.table` solution no later than Monday. – Yuri Robbers Aug 22 '15 at 23:33
  • The result is astonishing. This gets me the result I want in next to no time at all. Thanks a lot, @bdecaf!! – Yuri Robbers Aug 24 '15 at 17:40